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1  INTRODUCTION 


This  report  summarizes  the  work  we  have  performed  for  the  project  entitled  “Real-Time 
Simulation  Technologies  for  Complex  Systems.”  The  objective  of  this  effort  has  been  to 
develop  and  study  three  novel  complementary  directions  that  may  be  summarized  as  follows: 

1.  Speed  up  the  inherently  slow  simulation  process  of  complex  systems  by  exploiting  new 
concurrent  and  parallel  techniques. 

2.  Exploit  the  hierarchical  structure  in  multi-resolution  simulation  models  by  decompos¬ 
ing  them  in  ways  which  preserve  statistical  fidelity. 

3.  Explore  new  ways  to  extract  metamodels  for  complex  systems  from  simulation.  We  have 
been  specifically  targeting  a  radically  different  metamodeling  approach  using  neural 
networks. 

The  scope  of  the  project  has  been  to  develop  specific  methodologies  and  algorithms  and 
test  them  on  benchmark  problems  in  C4I  application  areas.  Thus,  appropriate  simulation 
models  were  built,  and  algorithms  based  on  the  proposed  new  techniques  were  developed  and 
tested.  In  many  cases,  the  benchmark  problems  studied  are  the  same  or  extensions  of  the 
ones  developed  during  our  previous  project  entitled  “Enabling  Technologies  for  Real-Time 
Simulation”  (see  also  [5]). 

As  in  our  last  report  [5],  we  begin  by  briefly  outlining  some  of  the  major  challenges 
faced  by  modeling  and  simulation  techniques  for  complex  systems  and  the  approaches  we 
are  following  to  address  these  challenges  (Section  1.1).  We  then  describe  the  organization  of 
this  report  (Section  1.2). 
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1.1  Issues  in  Modeling  and  Simulation  of  Complex  Systems 


Simulation  is  widely  recognized  as  one  of  the  most  versatile  and  general-purpose  tools  avail¬ 
able  today  for  modeling  complex  processes  and  solving  problems  in  design,  performance 
evaluation,  decision  making,  and  planning.  This  includes  C4I  environments,  where  most 
problems  confronted  by  designers  and  decision  makers  are  of  such  complexity  that  their 
analysis  and  solution  far  surpass  the  scope  of  available  analytical  and  numerical  methods; 
this  leaves  simulation  as  the  only  alternative  of  “universal”  applicability. 

The  importance  of  discrete  event  simulation  has  given  rise  to  a  number  of  commercially 
available  software  packages  (e.g.,  SIMAN,  SLAM,  SIMULA,  SIMSCRIPT,  MODSIM,  EX¬ 
TEND)  whose  applicability  ranges  from  very  generic  to  highly  specialized.  However,  the 
use  of  typical  simulation  software  is  limited  by  factors  such  as  the  following:  (a)  One  must 
have  thorough  knowledge  of  the  specific  tool  at  a  detailed  technical  level  before  attempting 
to  use  it  in  a  modeling  effort,  (b)  One  must  be  an  experienced  programmer,  in  addition 
to  a  decision  maker,  (c)  In  order  to  make  decisions  based  on  simulation,  one  usually  needs 
to  run  a  large  number  of  simulations  and  then  carefully  manage  all  output  data  collected 
on  a  case-by-case  basis,  (d)  The  field  of  simulation  was  developed  primarily  as  a  special 
branch  of  statistics  involving  dynamical  phenomena.  Manual  handling  and  analysis  of  in¬ 
put/output  data  is  still  the  norm,  while  design  of  interfaces,  componentware  interoperability, 
intelligent  and  automated  analysis  of  output  have  been  neglected.  For  example,  the  practice 
of  Object  Oriented  Programming  (OOP),  with  few  exceptions,  is  still  nascent  in  simulation 
languages  despite  the  fact  that  the  OOP  idea  actually  originated  in  simulation.  In  addition, 
hardware  advances,  such  as  massively  parallel  computers  and  workstation  networking,  are 
only  beginning  to  be  noticed  in  simulation  theory  and  practice,  (e)  The  ultimate  purpose  of 
simulation  is  often  system  performance  evaluation  and  optimization.  However,  simulation  is 
notoriously  computer  time-consuming  when  it  comes  to  parametric  studies  of  system  perfor- 
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mance.  Unless  substantial  speedup  of  the  performance  evaluation  process  can  be  achieved, 
systematic  performance  studies  of  most  real-world  problems  are  beyond  reach,  even  with 
supercomputers. 

With  this  brief  discussion  in  mind,  we  identify  below  some  of  the  issues  that  we  believe 
constitute  the  major  challenges  faced  by  simulation  technology  today,  and  introduce  some 
key  ideas  which  have  been  the  subject  of  further  study  in  this  project. 

1.  “What  if”  capability  and  concurrent  simulation.  A  major  goal  of  any  simulator 
is  to  provide  the  capability  to  explore  a  multitude  of  “what  if”  scenaria.  The  obvi¬ 
ous  way  to  obtain  answers  to  N  “what  if”  questions  is  to  perform  (N  +  1)  separate 
simulations:  one  for  some  baseline  scenario  and  N  additional  ones  for  each  “what  if” 
question.  If  a  typical  simulation  run  takes  T  time  units,  this  procedure  requires  a  total 
of  (AT-t-l)T  time  units.  In  concurrent  simulation,  this  objective  is  met  by  performing  a 
single  baseline  simulation,  but  endowing  it  with  the  capability  to  generate  all  (N  +  1) 
desired  simulations  concurrently.  This  is  accomplished  by  exploiting  an  intelligent 
sharing  of  data  which  results  in  a  total  simulation  time  of  (T  +  c)  «  (N  + 1  )T ,  where 
c  represents  the  overhead  corresponding  to  this  “intelligent  data  sharing” .  Concurrent 
simulation  can  be  carried  out  on  any  sequential  computer.  The  universal  applicability 
of  this  approach  has  been  the  subject  of  our  work  during  the  previous  and  current 
projects.  Although  this  issue  has  been  resolved  through  the  development  of  the  Time 
Warping  Algorithm  (TWA)  (see  [7]),  one  of  the  findings  of  this  project  is  that  its  com¬ 
putational  efficiency  significantly  depends  on  how  it  is  implememented  on  any  given 
computer  platform. 

2.  Hierarchical  simulation  and  statistical  fidelity.  In  modeling  complex  systems  it  is 
impossible  to  mimic  every  detail  through  detailed  simulation.  The  common  approach 
is  to  divide  the  whole  system  hierarchically  into  modules  with  different  simulation 
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resolutions.  The  lower,  high  resolution,  level  simulator  generates  reports  which  are 
then  taken  as  inputs  for  the  higher  level  simulator.  Current  practice  is  to  use  the  mean 
values  of  variables  from  the  lower  level  reports  as  the  input  to  the  higher  level.  This 
implies  that  significant  statistical  information  (i.e.,  statistical  fidelity)  is  lost  in  this 
process,  resulting  in  potentially  completely  inaccurate  results.  Especially  when  the 
ultimate  output  of  the  simulation  process  is  of  the  form  0  or  1  (e.g.,  “lose”  or  “win”  a 
combat),  such  errors  can  provide  the  exact  opposite  of  the  real  output.  Our  effort  has 
been  directed  at  developing  an  interface  between  the  two  simulation  levels  to  preserve 
the  statistics  to  the  maximum  extent  that  the  available  computing  power  allows.  In 
our  previous  project  [5]  we  initiated  a  study  of  an  approach  based  on  clustering  or  path 
bundle  grouping  which  was  further  pursued  in  this  project. 

3.  Metamodeling  through  Neural  Networks.  The  main  idea  of  metamodeling  is 
to  extract  as  much  information  from  simulation  as  possible  and  process  it  so  as  to 
build  a  surrogate  model  of  the  system  of  interest  which  is  much  simpler  (yet  accurate) 
to  work  with.  This  is  essentially  analogous  to  constructing  a  function  F(x i, . . .  ,xjsr) 
from  only  selected  values  observed  under  selected  combinations  of  values  of  . . . ,  xjy. 
The  problem,  of  course,  is  that  the  actual  function  we  are  trying  to  approximate 
with  F(xi, . . . ,  ajjv)  is  unknown.  The  most  common  approach  is  to  try  and  build  a 
polynomial  expression.  This  is  often  inadequate  because  if  the  shape  of  the  actual 
curve  corresponding  to  F(xi, . . . ,  x^)  includes  sudden  jumps  and  asymptotic  behavior 
(which  is  very  often  the  case  from  experience),  then  polynomial  fits  to  such  curves  are 
known  to  be  poor.  Thus,  obtaining  a  metamodeling  device  of  “universal”  applicability, 
i.e.,  one  capable  of  generating  functions  of  virtually  arbitrary  complexity,  remains 
an  open  issue.  This  project  has  explored  neural  networks  as  offering  this  capability, 
including  some  benchmark  models  and  problems  that  have  been  analyzed  with  these 
techniques. 


4 


1.2  Report  Organization 


The  contents  of  this  report  may  be  outlined  as  follows. 

•  Section  2:  The  basic  theoretical  framework  for  explaining  the  concurrent  simulation 
algorithms  we  have  developed  is  first  reviewed.  Based  on  this  framework,  a  general 
concurrent  simulation  approach  was  developed  under  our  previous  project,  where  a  de¬ 
tailed  Time  Warping  Algorithm  (TWA)  was  also  introduced.  The  concept  of  “speedup” 
was  used  in  order  to  provide  a  clear  quantitative  measure  of  the  improvement  provided 
by  concurrent  simulation  over  conventional  simulation  techniques.  This  report  includes 
the  following  new  findings  from  our  project:  (a)  extensions  of  the  TWA  that  enhance  its 
range  of  applications,  (b)  further  investigations  based  on  the  speedup  factor  and  some 
explicit  numerical  results,  and  (c)  a  study  of  the  statistical  significance  of  estimates 
obtained  through  the  TWA.  In  addition,  Section  2.7  describes  the  use  of  Concurrent 
Simulation  in  complex  resource  allocation  problems  requiring  answers  to  a  large  num¬ 
ber  of  ‘what  if”  questions  in  a  near-real-time  setting  to  support  decision  making  in  a 

C4/  setting. 

•  Section  3  The  stochastic  fidelity  preservation  issue  in  multi-resolution  models  of  com¬ 
plex  systems  is  discussed  in  detail.  The  Path  Bundle  Grouping  (PBG)  approach  is 
described  as  a  way  to  maintain  stochastic  fidelity  in  hierarchical  simulations.  The 
PBG  approach  is  related  to  the  theory  of  cluster  analysis.  We  have  reviewed  different 
aspects  of  this  theory  in  order  to  compare  them  to  the  Adaptive  Resonance  Theory 
(ART)  neural  network  we  have  developed  for  the  purpose  of  automating  the  task  of 
path  bundle  grouping.  We  also  report  on  an  interesting  “real-world”  application  we 
encountered  in  the  course  of  this  project. 

•  Section  4  We  review  the  main  concepts  involved  in  using  neural  networks  as  universal 
function  approximators.  A  particular  type  of  neural  network,  the  Cascade  Correlation 
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Neural  Network  (CCNN)  was  developed  for  the  purposes  of  this  project  for  its  ability 
to  build  itself  to  an  appropriate  size  as  part  of  its  learning  process  during  the  training 
phase.  This  was  used  in  [5]  on  a  testbed  model  using  the  Tactical  Electronic  Re¬ 
connaissance  Simulator( TERSM).  This  section  includes  several  numerical  results  and 
comparisons  with  polynomial  metamodels  previously  derived,  as  a  continuation  of  a 
study  initiated  earlier  and  presented  in  [5].  A  new  benchmark  problem  is  studied  based 
on  an  Aircraft  Refueling  and  Maintenance  System  (ARMS)  model. 

•  Section  5  We  present  the  main  conclusions  of  our  study,  including  lessons  learned 
and  recommendations.  We  also  outline  our  ongoing  work  and  some  future  research 
directions. 


2  CONCURRENT  SIMULATION 

A  major  objective  of  this  project  is  motivated  by  the  time-consuming  nature  of  system 
performance  exploration  through  simulation:  to  obtain  answers  to  N  “what  if”  questions, 
( N  +  1)  simulations  are  needed.  Therefore,  our  goal  is  the  following:  From  a  single  simu¬ 
lation ,  obtain  answers  to  all  N  “what  if”  questions  simultaneously.  The  main  idea  behind 
the  approach  we  used  to  solve  this  problem  is  to  observe  the  evolution  of  a  single  sample 
path  of  the  Discrete  Event  System  (DES)  under  study,  called  the  nominal  sample  path,  as  it 
operates  under  some  parameter.  As  the  sample  path  evolves,  observed  data  (e.g.,  event  oc¬ 
currences  and  their  corresponding  occurrence  times)  are  processed  to  concurrently  construct 
the  set  of  sample  paths  that  would  have  resulted  if  the  system  had  operated  under  a  set 
of  different  (hypothetical)  parameters.  Using  these  “concurrently  constructed”  hypotheti¬ 
cal  sample  paths,  it  is  possible  to  “concurrently  estimate”  the  corresponding  performance 
measures  which  can  be  used  in  the  design  of  the  actual  system. 

In  this  section  we  review  the  principles  of  concurrent  simulation  for  solving  the  fundamen- 
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tal  “sample  path  constructability  problem”  (Section  2.1).  We  use  the  concept  of  a  stochastic 
timed  state  automaton  as  a  modeling  framework  for  general  DES,  allowing  us  to  describe  a 
procedure  for  constructing  sample  paths  of  DES,  a  formal  way  of  characterizing  the  function 
of  any  discrete  event  simulation  software  package  (Section  2.2).  In  Section  2.3  we  summarize 
the  concurrent  simulation  method  developed  in  our  previous  project  [5]  for  solving  the  sam¬ 
ple  path  constructability  problem,  culminating  with  the  Time  Warping  Algorithm  (TWA). 
In  Section  2.4  we  summarize  extensions  that  we  were  able  to  obtain  during  the  course  of  this 
project.  A  significant  part  of  our  effort  was  also  directed  at  investigating  the  computational 
efficiency  of  the  TWA  and  our  findings  are  summarized  in  Section  2.4.  In  Section  2.5,  we 
also  address  the  issue  of  statistical  significance  of  estimates  obtained  through  the  TWA.  In 
an  Appendix  to  this  Final  Report  we  have  included  a  copy  of  a  recently  published  paper 
entitled  “Concurrent  Sample  Path  Analysis  of  Discrete  Event  Systems  ,  which  presents  the 
“concurrent  simulation”  algorithm  and  its  analysis  in  more  detail. 


2.1  The  Sample  Path  Constructability  Problem 

We  adopt  the  modeling  framework  of  a  stochastic  timed  state  automaton  ( £,X ,T,f,x 0)  [4]  to 
characterize  the  function  of  any  discrete  event  simulator.  Here,  £  is  a  countable  event  set, 
A'  is  a  countable  state  space,  and  T(x)  is  a  set  of  feasible  (or  enabled)  events,  defined  for  all 
x  e  X  such  that  T(x)  C  £.  The  state  transition  function  /(x,e)  is  defined  for  all  x  G  X, 
e  <E  T(x),  and  specifies  the  next  state  resulting  when  e  occurs  at  state  x.  Finally,  x0  is  a  given 
initial  state.  The  definition  is  easily  modified  to  (£,X,r,p,p0)  in  order  to  include  probabilistic 
state  transition  mechanisms.  In  this  case,  the  state  transition  probability  p(x  ,  x,  e  )  is  defined 
for  all  x,x'  €X,e'e  £,  and  is  such  that  p(x';x,e')  =  0  for  all  e'  £  T(x).  In  addition,  p0(x) 
is  the  pmf  P[®o  =  x],  x  €  X,  of  the  initial  state  Xo- 

Assuming  the  cardinality  of  the  event  set  £  is  N,  the  input  to  the  system  is  a  set  of 
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Figure  1:  The  sample  path  constructability  problem  for  DES 

event  lifetime  sequences  {Vj,  •  •  • ,  Vjv},  one  for  each  event,  where  V,  =  {u,(l),  u;(2),  •  •  •}  is 
characterized  by  some  arbitrary  distribution.  Under  some  system  parameter  90,  the  output 
is  a  sequence  £(0O)  =  k  =  1,2,  •••}  where  e*  €  £  is  the  kth  event  and  tk  is  its 

corresponding  occurrence  time  (see  Figure  1).  Based  on  any  observed  £(0O),  we  can  evaluate 
mOo)},  a  sample  performance  metric  for  the  system.  For  a  large  family  of  performance 
metrics  of  the  form  J(60)  =  J5[L[£(0o)]]>  T[£(0O)]  is  therefore  an  estimate  of  J(60).  Defining  a 
set  of  parameter  values  of  interest  {$o,  $i5  •  •  • ,  Om}i  the  sample  path  constructability  problem 
is: 


For  a  DES  under  0o,  construct  all  sample  paths  £(0i),  •  •  • ,  £(#m)  given  a  realiza¬ 
tion  of  lifetime  sequences  Vi,-*-,Vjv  and  the  sample  path  £(0O)- 

For  simplicity,  we  assume  that  the  system  we  are  modeling  satisfies  the  following  three 
assumptions.  Extensions  allowing  the  relaxation  of  these  assumptions  are  possible  and  are 
described  in  [7]  (see  Appendix). 

•  (Al)  Feasibility  Assumption :  Let  xn  be  the  state  of  the  DES  after  the  occurrence  of 
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the  nth  event.  Then,  for  any  n,  there  exists  at  least  one  r  >  n  such  that  e  G  T(a:r)  for 
any  e  G  £■ 

•  (A  2)  Invariability  Assumption:  Let  £  be  the  event  set  under  the  nominal  parameter  0o 
and  let  £m  be  the  event  set  under  0m  ^  @o-  Then,  £m  =  £. 

•  ( AS )  Similarity  Assumption:  Let  Gi(9o),  i  G  £  be  the  event  lifetime  distribution  for  the 
event  i  under  0O  and  let  Gt(0m),  i  €  £  be  the  corresponding  event  lifetime  distribution 
under  0m.  Then,  G,(0o)  =  G;(0m)  for  all  i  G  £■ 

Assumption  A1  guarantees  that  in  the  evolution  of  any  sample  path  all  events  in  £  will 
always  become  feasible  at  some  point  in  the  future.  Note  that  a  DES  with  an  irreducible  state 
space  immediately  satisfies  this  condition.  Assumption  A2  states  that  changing  a  parameter 
from  0O  to  some  0m  /  0o  does  not  alter  the  event  set  £.  More  importantly,  A 2  guarantees 
that  changing  to  0m  does  not  introduce  any  new  events  so  that  all  event  lifetimes  for  all 
events  can  be  observed  from  the  nominal  sample  path.  Finally,  assumption  A3  guarantees 
that  changing  a  parameter  from  0Q  to  some  9m  ^  90  does  not  affect  the  distribution  of  one 
or  more  event  lifetime  sequences.  This  allows  us  to  use  exactly  the  same  lifetimes  that  we 
observe  in  the  nominal  sample  path  to  construct  the  perturbed  sample  path.  In  other  words, 
our  analysis  focuses  on  structural  system  parameters  rather  that  distributional  parameters. 
However,  it  is  possible  to  handle  the  latter  at  the  expense  of  some  computational  cost,  as 
described  in  Section  2.4. 

2.2  Sample  Path  Construction  Procedure 

First,  let  £(n,  9)  =  {ej  :  j  -  1,  •  •  • ,  n},  with  ej  G  £,  be  the  sequence  of  events  that  constitute 
the  observed  sample  path  up  to  n  total  events.  Although  f(n,0)  is  clearly  a  function  of  the 
parameter  0,  we  will  write  f(n)  to  refer  to  the  observed  sample  path.  Next  we  define  the 
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score  of  an  event  i  e  £  in  a  sequence  £(n),  denoted  by  sf  =  [£(n)],-,  to  be  the  non-negative 
integer  that  counts  the  number  of  instances  of  event  i  in  this  sequence. 

We  introduce  two  additional  variables,  tn  to  be  the  time  when  the  nth  event  occurs,  and 
Vi{n ),  i  €  r(a:n),  to  be  the  residual  lifetime  of  event  i  after  the  occurrence  of  the  nth  event 
(i.e.,  it  is  the  time  left  until  event  i  occurs).  On  a  particular  sample  path,  just  after  the  nth 
event  occurs  the  following  information  is  known:  the  state  xn  from  which  we  can  determine 
r(x„),  the  time  tn ,  the  residual  lifetimes  yi(n)  for  all  i  €  r(xn),  and  all  event  scores  s“,  i  £  £. 
The  following  equations  describe  the  dynamics  of  the  timed  state  automaton. 


step  1:  Determine  the  smallest  residual  lifetime  among  all  feasible  events  at  state  xn ,  denoted 

by  y*- 

y*n  =  min{ys(n)}  (1) 

step  2:  Determine  the  triggering  event: 

en+1  =  arg  .min^y^n)}  (2) 


step  3:  Determine  the  next  state: 

•En+l  —  ^n+l)  (3) 

step  4-'  Determine  the  next  event  time: 

tn+i  =  tn  +  yn  (4) 


step  5:  Determine  the  new  residual  lifetimes  for  all  new  feasible  events  i  6  r(x„+1): 


yi(n  +  1) 


yi(n )  -  yl 

Vi(s?  +  !) 


if  i  ^  en+i  and  i  G  r(a:n) 
if  i  =  en+1  or  i  £  T(a:n) 


for  all  i  e  r(xn+1) 


(5) 
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step  6:  Update  the  event  scores: 


sn+i  =  f  +  1  if  *  =  e"+i  (6) 

1  1  s"  otherwise 

Equations  (l)-(6)  describe  the  sample  path  evolution  of  a  timed  state  automaton.  These 
equations  apply  to  both  the  observed  and  any  constructed  sample  paths  through  concurrent 
simulation. 

2.3  The  Time  Warping  Algorithm  (TWA) 

The  Time  Warping  Algorithm  is  a  specific  procedure  for  accomplishing  concurrent  simula¬ 
tion.  It  was  introduced  and  described  in  our  previous  project  [5],  so  that  we  limit  ourselves 
here  to  a  brief  review. 

We  begin  by  summarizing  the  necessary  notation.  We  use  £(k)  =  {tj  :  j  =  1,  •  •  • ,  k}  to 
denote  any  constructed  sample  path  under  a  different  value  of  the  parameter  0,  where  k  is 
the  number  of  events  in  that  path.  It  is  important  to  realize  that  k  is  actually  a  function 
of  ?t. ,  the  number  of  event  observed  on  the  nominal  path.  This  is  because  the  constructed 
sample  path  is  coupled  to  the  observed  sample  path  through  the  observed  event  lifetimes, 
however,  for  the  sake  of  notational  simplicity,  we  refrain  from  continuously  indicating  this 
dependence.  The  score  of  event  i  in  a  constructed  sample  path  is  denoted  by  s’-  =  [£(&)]»•  In 
what  follows,  all  quantities  with  the  symbol  “  t  ”  refer  to  a  typical  constructed  sample  path. 

Associated  with  every  event  type  i  G  £  in  £{n)  is  a  sequence  of  s”  event  lifetimes 

Vt-(n)  =  Ml),  •  •  • ,  »<(«")}  for  all  i  €  £ 

The  corresponding  set  of  sequences  in  the  constructed  sample  path  is. 

V.-(fc)  =  Wl),-,»i(i})}  for  all  i  €  £ 
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which  is  a  subsequence  of  Vt(n)  with  k  <  n.  In  addition,  we  define  the  following  sequence 
of  lifetimes: 

V;(n,  k )  =  +  !),-••,  for  all  i  £  £ 

which  consists  of  all  event  lifetimes  that  are  in  Vt-(n)  but  not  in  V,(fc). 

Next,  define  the  set 

A(n,  k)  =  [i:ie  £,  s"  >  sf }  (7) 

which  is  associated  with  Vt  (n,fc)  and  consists  of  all  events  i  whose  corresponding  sequence 
V;(n,  k)  contains  at  least  one  element.  Thus,  every  i  €  A(n,k)  is  an  event  that  has  been 
observed  in  £(n)  and  has  at  least  one  lifetime  that  has  yet  to  be  used  in  the  coupled  sample 

path  C(k).  Hence,  A(n,  k)  should  be  thought  of  as  the  set  of  available  events  to  be  used  in 
the  construction  of  the  coupled  path. 

Finally,  we  define  the  following  set,  which  is  crucial  in  our  approach: 

M(n,k)  =  T(xk)-(T(xk.1)-{ek})  (8) 

where,  clearly,  M(n,  k)  C  £.  Note  that  is  the  triggering  event  at  the  (k  —  l)th  state  visited 
in  the  constructed  sample  path.  Thus,  M(n,  k )  contains  all  the  events  that  are  in  the  feasible 
event  set  V(xk)  but  not  in  r(xfc_i);  in  addition,  also  belongs  to  M(n,k )  if  it  happens  that 
ek  £  r(xfc).  Intuitively,  M (n,  k)  consists  of  all  missing  events  from  the  perspective  of  the 
constructed  sample  path  when  it  enters  a  new  state  Xk :  those  events  already  in  T(xk~i)  which 
were  not  the  triggering  event  remain  available  to  be  used  in  the  sample  path  construction 
as  long  as  they  are  still  feasible;  all  other  events  in  the  set  are  “missing”  as  far  as  residual 
lifetime  information  is  concerned. 

With  this  notation,  we  now  present  the  Time  Warping  Algorithm  (TWA). 
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Time  Warping  Algorithm  (TWA): 


1.  INITIALIZE 

n  :=  0,  k  :=  0,  tn  :=  0,  tk  :=  0,  xn  :=  ®0,  =  ®o, 

yt(n)  =  u,(l)  for  all  i  £  r(s«),  s?  =  0,5*  =  0  for  all  i  €  5, 

M(0,0)  :=  r(x0),  A(0,0)  :=  0 


2.  WHEN  EVENT  en  IS  OBSERVED: 


2.1  Use  (l)-(6)  to  determine  en+i,  x„+i,  fn+i,  +  1)  f°r  *  £  r(xn+i),  <s,+  for 
all  i  £  £. 

2.2  Add  the  en+i  event  lifetime  to  V2(n  +  1,  fc): 


V<(n  +  l,fc) 


V{(n,  fc)  +  ut-(s”  +  1)  if  i  =  en+x 
Vi(n,k)  otherwise 


2.3  Update  the  available  event  set  A(n,  k):  A(n  +  1,  k)  =  A(n,  A:)  U  {en+i} 

2.4  Update  the  missing  event  set  M(n,k ):  M(n  +  1,  k)  =  M(n,k) 

2.5  IF  M(n  +  1,  A:)  C  A(n  +  l,k)  then  Goto  3.  ELSE  set  n  <—  n  +  1  and  Goto  2.1. 


3.  TIME  WARPING  OPERATION: 

3.1  Obtain  all  missing  event  lifetimes  to  resume  sample  path  construction  at  state 


f  +  1)  for  i  £  M(n  +  1,  k) 
1  yi(k  —  1)  otherwise 


3.2  Use  (l)-(6)  to  determine  ek+i ,  xk+u  4+ i,  Vi{k  +  1)  for  all  i  €  r(xfc+i)  H  (r(xfc) 
{efc+i}),  sf+1  for  all  i  £  S. 

3.3  Discard  all  used  event  lifetimes: 

V;(n  +  1,  k  +  1)  =  Vj(n  +  1,  A:)  -  +  1)  for  all  i  £  M(n  +  1,  k) 
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3.4  Update  the  available  event  set  A(n  +  1  ,k): 


A(n  +  1,  k  +  1)  =  A(n  +  1,  k)  -  {*  :  i  <E  M(n  +  1,  k),  sf+1  =  sf+1} 

3.5  Update  the  missing  event  set  M(n  +  1,  k ): 

M{n  +  1,  k  +  1)  =  T(xk+i)  -  (r(ifc)  -  {efc+1}) 

3.6  IF  M(n  +  l,fc  +  1)  C  A(n  +  l,k  +  1)  then  k  f-  k  +  1  and  Goto  3.1.  ELSE 
fcf-fc  +  l,u<-n  +  l  and  Goto  2.1. 

The  computational  requirements  of  TWA  are  minimal  (adding  and  subtracting  elements 
to  sequences,  simple  arithmetic,  and  checking  the  subset  condition  in  steps  2.5  and  3.6 
above.  Rather,  it  is  the  storage  of  additional  information  that  constitutes  the  major  cost  of 
the  algorithm. 

2.4  Extensions  of  the  TWA 

In  section  2.1  we  stated  three  assumptions  that  were  made  to  simplify  the  development  of 
our  approach  and  keep  the  TWA  notationally  simple.  It  turns  out  that  we  can  extend  the 
application  of  TWA  by  relaxing  these  assumptions  at  the  expense  of  some  extra  work. 

In  A  2  we  assumed  that  changing  a  parameter  from  90  to  some  6m  /  60  does  not  alter 
the  event  set  £.  Clearly,  if  the  new  event  set,  £m  is  such  that  £m  C  £,  the  development  and 
analysis  of  TWA  is  not  affected.  If,  on  the  other  hand,  £  C  £m,  this  implies  that  events 
required  to  cause  state  transitions  under  9m  are  unavailable  in  the  observed  sample  path, 
which  make  the  application  of  our  algorithm  impossible.  In  this  case,  one  can  introduce 
phantom  event  sources  which  generate  all  the  unavailable  events,  provided  that  the  lifetime 
distributions  of  these  events  are  known.  The  idea  of  phantom  sources  can  also  be  applied 
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to  DES  that  do  not  satisfy  Al.  In  this  case,  if  a  sample  path  remains  suspended  for  a  long 
period  of  time,  then  a  phantom  source  can  provide  the  required  event(s)  so  that  the  sample 
path  construction  can  resume. 

In  A  3  we  assumed  that  changing  a  parameter  from  90  to  some  9m  ^  90  does  not  affect 
the  distribution  of  one  or  more  event  lifetime  sequences.  This  assumption  is  used  in  step  2.2 
of  the  TWA  where  the  observed  lifetime  u,(sf  +  1)  is  directly  suffix-added  to  the  sequence 
V;(n+1,  k).  Note  that  this  problem  can  be  overcome  by  transforming  observed  lifetimes  Vi  = 
{uj(l),  Uj(2),  •  •  •}  with  an  underlying  distribution  Gi(90)  into  samples  of  a  similar  sequence 

corresponding  to  the  new  distribution  Gi(9m)  and  then  suffix-add  them  in  V;(n+1,  k).  This  is 
indeed  possible,  if  Gi{90),  Gi{9m)  are  known,  at  the  expense  of  some  additional  computational 
cost  for  this  transformation  (for  example,  see  [4]).  One  interesting  special  case  arises  when 
the  parameter  of  interest  is  a  scale  parameter  of  some  event  lifetime  distribution  (e.g.,  it  is 
the  mean  of  a  distribution  in  the  Erlang  family).  Then,  simple  rescaling  suffices  to  transform 
an  observed  lifetime  v,  under  90  into  a  new  lifetime  v{  under  9m: 

Vi  =  (9m  /  9q)v{ 

Finally  note  that  in  a  simulation  environment  it  is  possible  to  eliminate  the  overhead 
which  is  due  to  checking  the  subset  condition  in  step  2.5.  In  order  to  achieve  this  we  need 
to  eliminate  the  coupling  between  the  observed  and  constructed  sample  paths.  Towards 
this  goal,  we  can  simulate  the  nominal  sample  but  rather  than  disposing  the  event  lifetimes 
we  save  them  all  in  memory.  Once  the  simulation  is  done,  we  simulate  one  by  one  all  the 
perturbed  sample  paths  exactly  as  we  do  with  a  “brute  force”  simulation  scheme  but  rather 
than  generating  the  required  random  variates  we  read  them  directly  from  the  computer 
memory.  In  this  way  we  trade  off  computer  memory  for  higher  speedup.  A  quantification  of 
this  tradeoff  is  the  subject  of  ongoing  research. 
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2.5  Comparative  Speedup  Analysis 


To  define  the  speedup  factor  associated  with  concurrent  simulation,  suppose  that  the  sam¬ 
ple  path  constructed  through  our  coupling  approach  were  instead  generated  by  a  separate 
simulation  whose  length  is  defined  by  N  total  events.  Let  T/v  be  the  time  it  takes  (in  CPU 
time  units)  to  complete  such  a  simulation  run.  Further,  suppose  that  when  the  nominal 
simulation  is  executed  with  TWA  as  part  of  it,  the  total  time  is  given  by  Th  +  tk,  where 
Tfj  is  the  simulation  time  without  the  TWA  and  tk  is  the  additional  time  involved  in  the 
concurrent  construction  of  a  sample  path  with  K  <  N  events.  We  then  define  the  speedup 
factor  due  to  TWA  as 


c  = 

tk/I< 


(9) 


Thus,  if  a  separate  simulation  (in  addition  to  the  one  for  the  observed  sample  path)  were 
to  be  used  to  generate  a  sample  path  under  a  new  value  of  the  parameter  of  interest,  the 
computation  time  per  event  is  Tn/N.  If,  instead,  we  use  the  TWA  in  conjunction  with  the 
observed  path,  no  such  separate  simulation  is  necessary,  but  the  additional  time  per  event 
imposed  by  the  approach  is  tk/K ,  where  K  <  N  in  general. 


A  number  of  simulation  results  that  include  speedup  computations  relative  to  “brute 
force”  simulation  (i.e.,  separately  simulating  N  parameter  settings)  were  included  in  [5]. 
Another  interesting  issue  we  have  addressed  is  that  of  comparing  the  speedup  performance 
of  the  TWA  to  two  other  known  methods  for  concurrent  simulation,  Augmented  System 
Analysis  ASA  and  the  Standard  Clock  method  SC,  both  of  which  are  limited  to  Markovian 
event  processes.  As  expected,  both  techniques  can  achieve  higher  speedup  than  the  TWA, 
as  indicated  in  Figure  2  for  an  M/M/l/K  system  studied  over  a  range  of  values  for  K. 
ASA  can  achieve  a  speedup  of  up  to  30,  considerably  higher  than  both  SC  and  TWA, 
however,  it  is  only  applicable  to  systems  with  exponential  event  lifetime  distributions  (with 
the  possibility  of  one  non-Markovian  event  process,  as  noted  in  section  1).  SC  can  achieve 
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a  speedup  of  up  to  8,  however  it  is  also  limited  to  exponential  event  lifetime  distributions 
(unless  approximations  are  used  for  systems  with  general  distributions).  In  Figure  2,  the 
speedup  for  the  TWA  turns  out  to  be  in  the  vicinity  of  2. 

Finally,  note  that  from  the  definition  of  speedup  (9),  one  would  expect  that  it  is  not  a 
function  of  the  number  of  concurrently  constructed  sample  paths.  Intuitively,  suppose  that 
one  is  interested  in  concurrently  constructing  M  sample  paths.  Therefore,  using  brute  force 
it  would  take  MTjv  time  units  to  generate  M N  events,  while  using  any  other  constructability 
technique  it  would  require  Mtj<  time  units  to  construct  MK  events.  Hence,  the  speedup 
factor  should  be  independent  of  M.  However,  the  number  of  constructed  events  depends  on 
the  parameter  settings  as  well.  For  this  reason,  the  number  of  constructed  events  is  given 
by  MK  -  K0  (not  MK).  Therefore, 

MTn/MN  _Tn  (K  Kq  \ 
s  ~  Mtk/(MI<  -  Kq)  ~  tk\N  NMJ 

which  implies  that  it  approaches  a  constant  as  M  becomes  larger. 

2.6  Statistical  Significance  of  Estimates  obtained  through  TWA 

As  indicated  earlier,  the  constructed  sample  path  may  remain  suspended  for  extended  periods 
of  time  while  waiting  for  one  or  more  of  the  missing  events.  This  in  turn,  implies  that  while 
the  length  of  the  observed  sample  path  (N)  is  long  enough  to  guarantee  that  the  observed 
measures  are  statistically  significant,  the  length  of  the  constructed  sample  path  ( K )  may 
not  become  long  enough  to  provide  such  accuracy.  Figure  3  shows  the  ratio  ( K/N )  for 
an  M/M/l/K  queueing  system  when  there  are  four  classes  of  customers  and  the  observed 
sample  path  has  five  buffer  slots  (i.e.,  K  =  5).  First  note  that  when  all  events  occur  with 
similar  frequency,  the  K/N  ratio  converges  quickly  ( M/M/1/7  curve),  whereas,  when  there 
is  a  rare  event  (class  4  arrival)  often  the  constructed  sample  path  is  forced  to  wait  for  long 
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Figure  2:  Speedup  of  ASA,  SC  and  TWA,  for  an  M/M/1/ K  system. 

intervals  until  the  rare  event  is  observed  which  causes  the  K/N  ratio  to  become  small. 
Once  the  missing  event  is  observed,  a  large  number  of  events  may  be  immediately  processed 
allowing  K/N  to  increase  again.  This  results  in  the  initial  large  oscillations  observed  in 
Figure  3  which  eventually  diminish  as  N  grows  larger. 

In  addition,  note  that  the  parameter  setting  also  affect  the  K/N  ratio.  In  the  case  of  a 
single  buffer  slot  (M/M/1/1),  the  blocking  probability  is  much  larger  than  in  the  observed 
sample  path,  therefore,  several  observed  departure  events  are  not  constructed  because  the 
corresponding  customer  was  lost.  For  this  reason,  K/N  converges  to  a  value  less  than  one 
(in  this  case  0.85).  On  the  other  hand,  when  the  constructed  sample  path  has  nine  slots, 
the  observed  and  constructed  sample  paths  have  comparable  blocking  probabilities  therefore 
most  of  the  observed  event  are  also  constructed,  so  the  K/N  ratio  is  closer  to  one. 
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Figure  3:  K/N  ratio  for  system  1. 

2.7  Resource  Allocation  and  Concurrent  Simulation 

Resource  allocation  is  a  basic  problem  one  encounters  in  numerous  aspects  of  C4I  systems 
(e.g.,  allocating  ammunition  or  platforms  to  different  missions,  sensors  to  different  platforms, 
communication  bandwidth  to  battle  space  entities) .  The  mathematical  representation  of  such 
resource  allocation  problems  is  as  follows.  Let  r  €  be  the  decision  vector  or  “state”.  In 
general,  there  is  a  set  of  feasible  states  denoted  by  Ad  such  that  r  (E  Ad  represents  a  constraint. 
For  example,  in  a  typical  resource  allocation  problem,  r,-  denotes  the  number  of  resources 
that  user  i  is  assigned  subject  to  a  capacity  constraint  of  the  form  Ad  ^0* 

In  a  stochastic  setting,  let  Ld(r,u) )  be  the  cost  incurred  over  a  specific  sample  path  (denoted 
by  lo)  and  Jd{r )  be  the  expected  cost  of  a  system  operating  under  r.  Then,  the  discrete 
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optimization  problem  we  are  interested  in  is  the  determination  of  r*  €  Ad  such  that 

Jd(r*)  =  min  Jd(r )  =  min  Ew[Ld(r,u>)\  (10) 

In  general,  this  is  a  notoriously  hard  stochastic  integer  programming  problem.  Even  in  a 
deterministic  setting,  where  we  may  set  Jd(r)  =  Ld(r,u),  this  class  of  problems  is  NP-hard. 
When  the  system  operates  in  a  stochastic  environment  (e.g.,  in  a  resource  allocation  setting 
users  request  resources  at  random  time  instants  or  hold  a  particular  resource  for  a  random 
period  of  time)  and  no  closed-form  expression  for  Ew[Ld(r,u)\  is  available,  the  problem  is 
further  complicated  by  the  need  to  estimate  Ew[Ld(r,w)\.  This  generally  requires  Monte 
Carlo  simulation  or  direct  measurements  made  on  the  actual  system,  approaches  which  are 
generally  computationally  intensive  and  slow  for  the  purpose  of  rapid  decision-making. 

With  this  motivation,  we  have  worked  toward  developing  resource  allocation  algorithms 
that  can  take  explicit  advantage  of  Concurrent  Simulation  methods  such  as  the  TWA  dis¬ 
cussed  in  the  previous  section.  We  outline  below  our  basic  approach  and  include  in  the 
Appendix  a  paper  [16]  that  provides  technical  details  and  sample  numerical  results.  The  key 
idea  is  to  transform  the  original  discrete  set  Ad  into  a  continuous  set  over  which  a  “surrogate” 
optimization  problem  is  defined  and  subsequently  solved.  At  every  step  of  the  continuous 
optimization  process,  the  continuous  state  obtained  is  mapped  back  into  a  feasible  discrete 
state;  based  on  a  realization  under  this  feasible  state,  new  sensitivity  estimates  are  obtained 
that  drive  the  surrogate  problem  to  yield  the  next  continuous  state.  The  proposed  scheme, 
therefore,  involves  an  interplay  of  sensitivity-driven  iterations  and  continuous-to-discrete 
state  transformations.  The  key  issue  then  is  to  show  that  when  (and  if)  an  optimal  alloca¬ 
tion  is  obtained  in  the  continuous  state  space,  the  transformed  discrete  state  is  in  fact  r*  in 
(10). 

Clearly,  he  ability  to  obtain  sensitivity  estimates  with  respect  to  discrete  decision  vari¬ 
ables  is  a  critical  component  of  this  approach.  Concurrent  Simulation  methods  are  ideally 
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suited  to  meet  this  objective  and  provide  the  synergy  required  between  real-time  decision 
making  approaches  and  the  need  for  rapid  simulation-based  information  to  support  these 
approaches.  The  Appendix  provides  additional  material  that  details  this  approach  as  an 
application  of  Concurrent  Simulation. 

3  STOCHASTIC  FIDELITY  IN  MULTI-RESOLUTION 
SIMULATION  MODELS 

In  modeling  complex  systems  through  simulation,  such  as  theater-level  combat  or  large 
scale  high  speed  communication  networks,  it  is  impossible  to  mimic  every  detail.  The  most 
common  approach  is  to  hierarhically  decompose  the  whole  system  into  modules  with  different 
simulation  resolutions.  The  low  resolution  module  consists  of  some  coarse-grained  equations, 
which  are  used  to  model  the  large  scale,  low  resolution  behavior.  The  coefficients  of  these 
“low  resolution”  equations  are  derived  from  the  high  resolution  module,  which  executes 
some  detailed,  smaller  scale  simulations.  The  interfaces  between  these  modules  are  critical. 
Common  practice  is  to  use  the  overall  average  of  the  high  resolution  results  as  the  input 
to  the  low  resolution  simulator.  This  neglects  the  facts  that  (a)  the  high  resolution  results 
could  have  dramatically  different  qualitative  features  and  should  not  be  just  averaged 
quantitatively ,  and  (b)  even  when  the  high  resolution  simulation  paths  are  qualitatively 
similar,  the  variance  of  the  paths  still  needs  to  be  accounted  for  in  the  low  resolution  module. 

To  address  (a)  our  approach  is  to  cluster  the  high  resolution  combat  simulation  sample 
paths  according  to  their  features  and  use  the  averages  over  each  cluster  as  inputs  for  the  low 
resolution  module.  To  address  (b)  our  approach  is  to  derive  new  low  resolution  dynamics 
that  would  account  for  the  higher  moments  of  the  high  resolution  simulation  paths.  In  the 
following,  we  first  discuss  the  issue  of  clustering  large  dimensional  data. 
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Clustering.  The  problem  of  systematically  clustering  data  has  drawn  considerable  re¬ 
search  attention.  Later  in  this  section,  we  give  a  brief  review  of  some  classical  clustering 
algorithms.  These  algorithms  are  too  time-consuming  when  the  data  vector  dimensions  are 
high.  In  this  project,  we  have  considered  the  neural  network  (NN)  technology,  specifically 
a  NN  type  known  as  ART  NN,  where  ART  stands  for  Adaptive  Resonance  Theory ,  as  de¬ 
veloped  by  Carpenter  and  Grossberg  [3].  The  key  component  in  the  ART  NN  design  is  a 
feedback  mechanism  that  stabilizes  the  learning  process.  The  input  data  is  mixed  with  the 
trial  feature  for  competitive  learning,  forming  a  feedback  control  loop  where  the  input  data 
is  the  external  input,  while  the  trial  feature  is  the  output  which  is  fed  back  to  mix  with  the 
input  to  stabilize  the  system.  The  key  for  success  is  of  course  the  tuning  of  the  feedback 
gain!  This  is  motivated  by  human  learning  processes:  enforcing  already  learned  prototypes 
on  the  new  data.  It  turns  out  that  this  is  quite  successful  in  ART  clustering.  Some  numerical 
results  are  reported  in  [17]  and  included  in  the  Appendix. 

The  ART  NN  is  designed  to  deal  with  a  classical  dilemma  in  data  clustering:  if  the 
requirement  for  distinct  high  resolution  paths  is  too  tight,  then  the  resulting  number  of 
clusters  will  be  too  large  to  provide  any  benefit;  if  the  requirement  is  too  loose,  then  the 
resulting  clusters  will  not  in  fact  contain  any  characteristic  features  of  the  high  resolution 
paths  they  encompass.  The  ART  NN  resolves  this  dilemma  in  this  way:  it  tries  to  learn  the 
similarities  between  high  resolution  paths,  and  create  “prototype”  clusters;  as  the  process 
going  on,  it  matches  further  data  with  existing  prototypes.  If  no  match  is  found,  then  a 
new  prototype  is  created.  Thus,  previously  learned  information  is  never  eroded  by  new 
knowledge. 

Deriving  low  resolution  dynamics.  We  now  discuss  the  issue  of  deriving  low  resolu¬ 
tion  dynamics,  in  which,  the  variability  of  the  high  resolution  simulation  paths  is  considered. 
Here  is  the  basic  idea:  given  a  set  of  low  resolution  differential  equations,  these  equations 
are  based  on  the  assumption  that  the  high  resolution  output  is  deterministic.  The  high 
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resolution  simulation  is  used  to  determine  the  coefficients  of  the  low  resolution  differential 
equations.  Since  the  actual  high  resolution  paths  are  random,  the  low  resolution  dynamics 
should  be  modified.  In  the  case  of  combat  simulation,  our  proposed  such  modification  is 
based  on  the  analysis  of  the  basic  Lan Chester  equation.  However,  the  principle  is  generally 
applicable  for  hierarchical  simulation  models. 

The  two  ideas  described  above  are  of  general  importance.  Combat  modeling  is  just 
a  typical  example  of  a  multi-time/space-scale  system.  Multiple-time-scale  phenomena  are 
encountered  in  numerous  fields,  including  computer  networks,  manufacturing  engineering, 
real-time  systems,  battlefield  (combat)  simulation,  and  physics.  For  example,  in  high-speed 
networks,  traffic  sources  operate  on  at  least  three  different  time  scales:  connections  for 
seconds  to  minutes,  bursts  within  a  connection  for  tens  to  hundreds  of  milliseconds,  packets 
within  a  burst  of  the  order  of  tens  to  hundreds  of  microseconds  per  packet.  These  present 
a  tremendous  challenge  to  the  performance  analyst.  For  instance,  to  study  algorithms  for 
establishing  network  connections  in  teleconferences  (scale  of  minutes),  one  must  capture  the 
essential  effects  on  those  connections  of  bursty  packets  (scale  of  milliseconds).  As  another 
example,  the  lifetime  of  real-time  processors  can  be  several  thousand  hours;  however,  when  a 
processor  fails,  fault  detection,  recovery,  and  the  consequent  switchover  to  another  processor 
must  take  place  in  a  few  milliseconds.  The  life  time  of  a  processor  has  a  quite  different  time- 
scale  from  the  failure  detection  and  recovery  process.  A  third  example,  from  control  theory, 
is  controlling  the  position  of  a  rotating  flexible  steel  shaft  pinned  at  one  end,  such  as  a  flexible 
robotic  manipulator.  The  slow  subsystem  would  correspond  to  the  motion  of  the  center  of 
mass  as  the  object  rotates,  while  the  fast  subsystem  would  correspond  to  the  vibrational 
motion  of  the  flexible  shaft  or  the  position  of  the  free  end  of  the  vibrating  shaft.  Simulating 
such  systems  using  traditional  methods  is  very  time-consuming,  since  the  simulator  has  to 
work  at  the  highest  level  of  time  resolution. 
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3.1  Hierarchical  Simulation 


A  general  hierarchical  simulation  scheme  is  summarized  in  Figure  4(a).  A  hierarchical  sim¬ 
ulation  consists  of  high-resolution,  and  low-resolution  (or  coarser),  modules.  The  high- 
resolution  module  is  the  usual  discrete-event  simulation,  while  the  lower-resolution  module 
consists  of  one  or  more  of  the  following  components:  differential  equations  (used  for  example 
in  combat  [23]  and  semiconductor  simulations  [19]),  standard  discrete-event  simulation,  and 
fluid  simulation  [25]  (currently  being  developed  by  our  research  group  and  showing  promise 
in  computer  network  simulation).  There  is  an  interface  between  the  two  modules  which 
forwards  some  statistics  of  the  high-resolution  module  output  (usually,  the  average)  to  the 
lower-resolution  module.  The  design  of  the  interface  is  often  critical  to  the  success  of  the 
entire  simulation.  Very  little  attention  has  been  paid  to  this  area:  it  is  one  of  the  main  focal 
points  of  our  research. 


Hierarchical  Simulation 


Low  Resolution  < 


Differential 
/ Equations 

^ Discrete-Event 
Simulation 


^  Fluid  Simulation 


High  Resolution 


Overall  State  Space 


(a)  Bask  Hierarchkal  Simulation  Scheme 


(b)  Use  of  Clustering 


Figure  4:  Hierarchical  Simulation  Structure 


Hierarchical  simulation  is  a  common  practice,  but  the  design  is  always  ad  hoc.  A  sys¬ 
tematic  design  and  analysis  framework  is  definitely  needed.  In  this  project,  we  developed 
some  fundamental  components  of  such  a  framework.  As  mentioned  above,  the  key  issue 
in  hierarchical  simulation  is  the  design  of  the  interface  between  the  hierarchies.  Common 
practice  is  to  use  the  average  as  the  statistics  to  be  passed  to  the  lower-resolution  module. 
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This  is  always  highly  unsatisfactory,  since  the  mean  often  obscures  even  the  most  important 
features  of  the  high  resolution  output.  In  this  part  of  the  work,  we  develop  methods  to  iden¬ 
tify  suitable  techniques  to  provide  higher-resolution  output  to  the  lower-resolution  module. 
Our  research  is  based  on  the  theory  of  random  perturbations  in  dynamic  systems  (see,  e.g. 
[12,  18])  and  the  clustering  of  large  vectors  using  neural  networks  (see,  e.g.  [3]). 

Impact  of  Higher  Moments.  To  show  why  higher  moments  of  the  high-resolution 
module  are  needed  as  inputs  to  the  lower-resolution  module,  let  us  consider  the  case  of 
hierarchical  combat  simulation.  This  is  an  important  application  area  and  its  structure  is 
typical  of  most  hierarchical  simulation  problems. 

In  the  following,  we  will  present  a  greatly  simplified  and  idealized  treatment  of  combat 
simulation.  Actual  combat  simulations  are  far  more  complex  (e.g.,  the  COSAGE  package 
has  27,000  lines  of  SIMSCRIPT);  however,  this  simplified  version  will  be  a  good  expository 
example. 

In  combat  simulation,  the  high-resolution  component  is  mostly  event-driven,  while  the 
lower-resolution  simulator  is  typically  a  generalization  of  the  well-known  Lanchester  attrition 
equation  [21].  This  consists  of  a  set  of  linear  differential  equations,  stating  that  the  attrition 
of  one  side  is  proportional  to  the  strength  of  the  other  side.  Incidentally,  the  Lanchester 
equations  are  also  used  in  economics  to  model  competition  between  large  corporations. 

The  basic  Lanchester  equations  are 

6(f)  =  -0r(t),  r(t)  =  -cb(i)  (11) 

where  6(f),  r(t)  denote  the  respective  strengths  at  time  t  of  the  two  sides  engaged  in  combat. 
Given  the  initial  conditions,  6(0)  =  60,  r(0)  =  r0  and  information  on  the  coefficients  0,c, 
these  equations  can  be  solved. 

The  coefficients,  0,  c,  are  estimated  using  a  high-resolution  discrete-event  simulator.  By 
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the  very  nature  of  their  creation,  these  estimates  have  an  uncertainty  associated  with  them. 
It  is  important  to  study  the  effect  of  this  uncertainty  on  the  behavior  of  the  Lanchester 
equations  and  ultimately  its  output,  which  is  the  final  output  of  the  entire  combat  simulator. 

In  the  following  discussion,  purely  for  brevity  of  the  description,  we  will  assume  that  c 
is  known  perfectly,  while  9  is  a  random  variable,  in  fact,  however,  both  9  and  c  are  random 

[14]- 

Assume  9  is  random,  with  E[8]  =  /i,  Var[0]  =  a2.  To  understand  the  impact  of  the 
uncertainty  (randomness)  in  9 ,  we  proceed  as  follows.  Let  a  be  a  sample  value  of  9.  Then, 
we  can  conveniently  treat  b  and  r  as  functions  of  both  a  and  t ;  that  is,  write  them  as 
b(a,t),r(a,t).  Let  b(a,t),  r(a,t)  satisfy 

jjb{a,t)  =  -ar(a,t),  ^r(a,t)  = -cb(a,t)  (12) 

such  that  6(a,0)  =  60,  r(a,0)  =  r0. 

^From  these  equations  we  can  easily  derive  differential  equations  for  db(a,t)/da  and 
dr(a,t)/da  and  the  second  derivatives.  These  derivatives  can  determine  the  second  order 
statistics  of  the  basic  quantities  b(9,t),r(9,t),  which  would  be  lost  in  the  current  “mean 
value”  practice.  These  derivatives  will  also  provide  some  refined  approximations  for  E[b(9,  i)] 
and  i?[r(0,t)].  In  particular,  if  a  is  small  we  would  have 

£[&((!,()]  »  +  (13) 

E{r(0,t ))  «  rM  +  (14) 

Some  immediate  observations  are: 

•  The  traditional  method  of  using  the  mean  of  the  high-resolution  simulation  results  as 
the  coefficients  of  the  Lanchester  equation  could  cause  serious  bias  due  to  the  variance. 
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•  Such  a  bias  can  be  significantly  reduced  by  using  high-order  statistics. 

The  above  has  been  an  idealized  treatment;  in  actuality,  the  attrition  equations  are  far  more 
complicated.  However,  the  principle  outlined  here  is  clearly  generic  and  holds  for  more  com¬ 
plicated  cases  as  well.  Actually,  the  principal  idea  is  generally  applicable  to  hierarchical 
simulation  in  many  other  areas  such  as  the  modeling  of  manufacturing  systems,  semiconduc¬ 
tor  device  [19],  and  telecommunication  networks. 

We  now  describe  an  application  in  a  hierarchical  combat  simulation  model  we  have  worked 
on.  This  model,  called  “Concept  Evaluation  Model  (CEM)”,  has  been  used  by  the  U.S.  Army 
Concept  Analysis  Agency.  The  high  resolution  module  in  this  model  is  called  “COmbat 
SAmple  GEnerator  (COSAGE)”  and  it  generates  battle  paths  at  division  or  lower  levels. 
The  high  resolution  module  contains  a  set  of  attrition  equations  based  on  the  Lanchester 
principle  and  is  called  “ATtition  CALculation  (ATCAL)”.  The  attrition  equations  for  the 
direct  fire/point  fire  case  are: 

(A Nk)ij  =  Wi(RATE)ijPijk[l  -  (1  -  Aijk)Nk]  lit1  ~  Aljk'}Nk. 

k1 

ANk  =  (1  -  e~ANk,77k)Nk 

where: 

•  ( ANk)ij  is  attrition  of  the  kth  vehicle  by  the  jth  weapon  of  the  ith  vehicle, 

•  iV;,  ~Nk  are  the  average  number  of  the  ith  (red)  and  the  kth.  (blue)  vehicles, 

•  (RATE)ij  denotes  the  rounds  that  the  jth  weapon  of  the  ith  vehicle  can  fire  during 
an  engagement, 

•  Pijk  is  the  probability  of  kill  per  round, 
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•  Aijk  is  the  target  availability  -  the  fraction  of  time  that  the  fcth  target  is  available  for 
the  jth  weapon  of  the  ith  vehicle, 

•  k'  is  the  index  of  vehicles  that  have  higher  priority  than  k. 

To  implement  the  perturbation  analysis  described  above  we  need  to  calculate  derivatives 
such  as  d(ANk)ij)/ dPijk  from  the  above  equations.  The  implementation  is  straightforward. 

Clustering.  As  briefly  discussed  before,  another  important  issue  in  presenting  the  results 
of  the  high-resolution  module  to  the  low-resolution  module  is  that  of  aggregating  the  high- 
resolution  results. 

Quite  often,  the  system  being  simulated  is  such  that  the  high-resolution  simulator  pro¬ 
duces  so  widely  divergent  outputs  that  it  does  not  make  sense  to  summarize  its  output  over 
the  entire  sample  space.  In  such  cases,  we  must  subdivide  the  sample  space  into  segments, 
and  get  the  high-resolution  simulator  to  produce  an  appropriate  input  to  the  low-resolution 
simulator  for  each  such  segment.  We  have  dealt,  in  the  previous  sections,  with  the  issue 
of  how  to  generate  appropriate  statistics  for  each  segment.  Here,  we  will  consider  how  to 
carry  out  the  subdivision  of  the  sample  space.  Essentially,  the  low-resolution  simulation  will 
be  broken  down  into  a  number  of  distinct  simulations,  one  for  each  segment  of  the  sample 
space,  as  depicted  in  Figure  1(b). 

To  carry  out  such  a  segmentation,  the  high-resolution  paths  need  first  to  be  grouped 
by  their  common  features.  These  features  then  determine  the  corresponding  low-resolution 
model.  Each  high-resolution  output  group  then  feeds  a  corresponding  low-resolution  simu¬ 
lator. 

To  group  the  high-resolution  sample  paths,  we  need  to  perform  clustering  analysis  for 
usually  huge  dimensional  data.  The  exact  clustering  is  very  time-consuming.  The  practice 
of  classifying  objects  according  to  perceived  similarities  is  the  basis  for  much  of  science  and 
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engineering.  Organizing  data  into  sensible  groupings  is  one  of  the  most  fundamental  modes 
of  understanding  and  learning.  Clustering  methods  have  been  widely  applied  in  pattern 
recognition,  image  processing,  and  artificial  intelligence.  In  this  project,  we  dealt  with 
clustering  methods  for  the  preservation  of  statistics  in  hierarchical  simulation. 

A  large  collection  of  clustering  algorithms  is  available  in  a  variety  of  scientific  disciplines 
and  new  clustering  programs  continue  to  appear  in  the  scientific  literature.  Our  focus  has 
been  on  Adaptive  Resonance  Theory  (ART)  to  cluster  high-dimensional  data  vectors.  Ad¬ 
vantages  of  this  approach  include  its  computational  efficiency,  as  well  as  allowing  the  user 
to  easily  control  the  degree  of  similarity  of  patterns  placed  on  the  same  cluster.  Our  experi¬ 
mental  results  corroborate  these  observations. 

ART  neural  networks  were  developed  by  Carpenter  and  Grossberg  [3]  to  understand  the 
clustering  function  of  the  human  visual  system.  They  are  based  on  a  competitive  learning 
scheme  and  are  designed  to  deal  with  the  stability/plasticity  dilemma  in  clustering  and 
general  learning.  It  is  clear  that  too  much  stability  would  lead  to  a  stubborn  mind,  while 
too  much  plasticity  would  lead  to  unstable  learning.  ART  neural  networks  successfully 
resolve  this  dilemma  by  matching  the  input  pattern  with  the  prototypes.  If  the  matching 
is  not  adequate,  a  new  prototype  is  created.  In  this  way,  previously  learned  memories  are 
not  eroded  by  new  learning.  In  addition,  the  ART  neural  network  implements  a  feedback 
mechanism  during  learning  to  enhance  stability. 

Our  experiments  of  using  ART  neural  networks  with  combat  simulation  paths  have  been 
quite  successful  [6].  We  believe  further  improvement  with  the  ART  structure  can  lead  to 
a  fundamental  breakthrough  in  large  data  clustering,  which  is  needed  in  complex  systems 
modeling.  A  neural  network  could  be  developed  into  a  generic  numerical  clustering  tool  for 
many  important  problems  in  intelligent  data  analysis. 

It  is  worth  mentioning  that  during  this  project  we  located  an  alternative  non-parametric 
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clustering  method  recently  been  introduced  by  Domany  and  co-workers  [2]  which  holds 
promise  for  application  to  large  databases.  The  idea  is  to  embed  the  data  in  a  spin  model 
with  each  data  point  represented  by  a  single  spin.  The  distance  between  nearby  points 
determines  the  strength  of  a  ferromagnetic  coupling  between  the  spins.  A  procedure  akin 
to  simulated  annealing  is  then  applied  to  this  spin  system.  Using  standard  Monte  Carlo 
methods  for  equilibrating  spin  systems  at  some  temperature,  the  temperature  is  lowered  to 
a  range  where  data  points  with  sufficient  similarity  are  clustered.  By  adjusting  the  temper¬ 
ature  in  the  “superparamagnetic”  regime,  coarse  or  fine-grained  clustering  can  be  achieved. 
It  remains  an  open  issue  to  identify  the  possible  application  of  this  scheme  to  general  data 
clustering  and  specific  tasks  involved  in  combat  simulation. 

An  Application  to  a  “real-world”  complex  system.  During  the  course  of  the 
project,  we  encountered  an  interesting  opportunity  to  test  the  clustering  techniques  we  de¬ 
veloped  in  the  case  of  a  complex  manufacturing  system.  In  particular,  in  working  with  a 
large  metal  manufacturer  we  were  faced  with  the  issue  of  supplying  a  low-resolution  model 
of  a  large  plant  with  the  necessary  parameters  for  running  it,  much  like  the  coefficients  of 
the  Lanchester  equation  in  (11).  These  parameters  are  to  be  obtained  from  detailed  (high- 
resolution)  models  of  the  process  plans  (or  flowpaths)  for  over  10,000  products  manufactured 
in  the  plant.  A  flowpath  is  a  specific  sequence  of  Production  Centers  (PCs)  with  different 
processing  characteristics  at  each  PC  (there  are  over  100  such  PCs).  Thus,  each  flowpath 
may  be  thought  of  as  corresponding  to  a  unique  product;  however,  since  the  low-resolution 
model  cannot  possibly  handle  input  data  for  over  10,000  flowpaths,  the  objective  is  to  group 
products  with  similar  flowpaths.  For  purposes  such  as  forecasting,  capacity  planning,  and 
lead-time  estimation  (among  others)  it  is  in  fact  indispensable  to  have  such  product  groups 
available:  not  only  it  is  conceptually  infeasible  to  work  with  over  10,000  distinct  products,  it 
is  also  practically  impossible  to  input  such  high-dimensional  data  for  over  10,000  products 
and  100  PCs  into  modeling  and  decision  support  tools.  Moreover,  even  if  there  were  an 
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automated  way  to  accomplish  this,  it  would  be  unrealistic  to  expect  anyone  to  manipulate 
or  interpret  output  data  with  information  such  as  inventory  levels  and  lead  times  for  many 
thousands  of  distinct  products. 

In  the  effort  to  establish  groups  (or  clusters)  of  products  based  on  similarities  in  flowpaths 
and  processing  characteristics,  an  initial  project  was  set  up  with  plant  experts  given  the  task 
to  “manually”  create  such  groupings.  The  project  was  quickly  abandoned:  in  addition  to 
the  sheer  product  volume  which  makes  this  task  prohibitive,  it  is  also  difficult  to  rationally 
quantify  “similarities”  in  flowpaths  and  processing  data  without  some  systematic  means  of 
doing  so.  We  were  able  to  accomplish  this  task  using  the  clustering  techniques  we  have 
developed  and  obtained  a  “compression”  of  over  10,000  products  to  25-100  product  clusters 
(depending  on  the  aggregation  accuracy  required,  which  is  completely  controlled  by  the 
analyst).  Of  particular  interest  is  the  fact  that  the  plant  experts  who  reviewed  the  results 
we  obtained  found  “by  hindsight”  the  clusters  defined  by  our  method  consistent  with  their 
expectations. 

As  mentioned  above,  it  should  be  possible  to  evaluate  the  result  of  a  particular  grouping 
(or  clustering)  of  flowpaths  and  to  compare  it  to  alternative  groupings  so  as  to  determine  the 
appropriate  level  of  flowpath  aggregation  desired  depending  on  the  application  of  interest. 
For  some  tasks,  less  than  100  groups  may  be  amply  adequate,  while  for  others  it  may  be 
necessary  to  use  a  form  of  grouping  based  on  tighter  similarity  requirements  that  would  yield 
several  hundred  flowpath  clusters  aggregated  into  product  groups.  In  general,  the  “tighter” 
the  similarity  requirements  imposed,  the  larger  the  number  of  resulting  clusters  is  likely 
to  be.  This  capability  is  an  integral  part  of  the  clustering  tools  we  are  developing  and  is 
captured  by  the  so-called  “vigilance  parameter”. 
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3.2  Basic  Concepts  of  Clustering 

In  this  section,  we  report  our  review  of  some  basic  concepts  in  data  clustering  techniques. 
This  review  is  important  because  it  not  only  give  the  background  information  for  our  work, 
it  also  motivates  ideas  and  concepts  for  possible  new  techniques. 

3.2.1  Classification  and  Clustering 

Classification  is  the  actual  or  ideal  arrangement  of  patterns  which  are  alike,  and  the  separa¬ 
tion  of  those  which  are  not;  the  purpose  of  the  arrangement  is  to  shape  and  keep  knowledge, 
to  analyze  the  structure  of  phenomena,  and  to  relate  different  aspects  of  a  phenomenon. 
Applications  of  classification  in  science  include  Library  Science  and  Information  Retrieval, 
Mathematics,  Biology,  Physics  and  Chemistry,  Social  and  Political  Science. 

Clustering  is  the  mathematical  technique  designed  to  reveal  classification  structures  in 
data  collected  from  real-world  phenomena;  the  purpose  of  clustering  is  to  (a)  analyze  the 
structure  of  the  data,  (b)  relate  different  aspects  of  the  data  to  each  other,  and  (c)  assist  in 
classification  design. 

A  typical  problem  is  of  the  following  form:  Given  a  set  of  entities,  determine  its  subsets 
(called  clusters ),  which  are  homogeneous  and  well  separated.  Homogeneity  means  that 
entities  in  the  same  cluster  should  resemble  each  other;  separation  means  that  entities  in 
different  clusters  should  not. 

3.2.2  Clustering  Framework 

The  basic  framework  of  clustering  consists  of  eight  elements: 

1.  Sampling:  Select  a  set  O  =  {0%,  O2, ...,  On}  of  N  entites. 
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2.  Data  collection ::  Observe  or  measure  p  characteristics  on  each  of  the  entities  of  0. 
This  leads  to  a  data  matrix  X  of  size  N  x  p  with  binary  entries  for  observations  and 
real  entries  for  measurements. 

3.  Dissimilarity:  Construct  a  matrix  D  =  ( dki )  of  size  N  x  N,  which  is  generated  from 
data  matrix  X.  Dissimilarities  satisfy  the  following  properties: 

•  Symmetry:  dki  =  d 

•  Non-negativity:  dki  >  0 

•  Vanishing  diagonal  elements:  dkk  =  0,  for  l,k  —  1,2, 

The  Euclidean  distance  on  a  Euclidean  plane  is  an  example. 

4.  Types  of  clustering: 

We  list  some  of  the  most  important  clustering. 

1)  Subpartition  SPm  —  {Ci, C2,  ..., Cm}  of  O  with  M  clusters:^-  C  0,Cj  ^  0, Ci  fi 
Cj  =  0  for  i,j  =  1,2, ...,  M. 

Note  that  here  the  clusters  are  not  required  to  cover  the  entire  set  O; 

2)  Partition  Pm  =  { Ci ,  C2, ...,  Cm}  of  O  into  M  clusters:  Cj  7^  0,  CiOCj  =  0,  U jL\Cj  = 
O;  for  i,j  =  1,2, ...,  M. 

Note  that  here  the  clusters  are  required  to  cover  the  entire  set  O; 

3)  Covering  COm  =  {Ci,  C2, ...,  Cm}  of  0  by  M  clusters:  Cj  7^  0,U Y=\Cj  =  0;  for 
i  =  1,2,...,  Af; 

4)  Hierarchy  HSp  =  {S Pu  S P2l S PK}  of  subpartition  of  O:  Set  of  I<  subpartitions 
S Pi,  SP2,  •••,  SPk  of  0  such  that  C,  G  SPk ,  Cj  G  SPi  and  k  >  l  Cj  C  CiorCi  D  Cj 

0. 

Note  that  here  the  clusters  are  not  required  to  cover  the  entire  set  0; 
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5)  Hierarchy  H  =  {Pi,  P2,  •  ••,  P/v)  of  partition  of  0:  Set  of  N  partitions  Pi,  P2, Pw 
of  0  such  that  Ci  €  P/c,  Cj  €  Pi  and  k  >  l  =>  C{  C  Cj  or  C,  fl  Cj  =  0. 

Note  that  here  the  clusters  are  required  to  cover  the  entire  set  0. 

5.  Criterion:  Select  a  criterion  to  evaluate  the  clusterings  of  the  type  decided  upon  in 
Step  4.  Such  a  criterion  may  be  of  one  of  the  following  types: 

1)  Threshold- type  criteria,  in  which  a  single  dissimilarity  determines  this  value. 

2)  Sum-type  criteria,  in  which  a  sum  of  dissimilarities  involving  one  entity  determines 
this  value. 

3)  Sum-sum-type  criteria,  in  which  all  dissimilarities  between  pairs  of  entities  of  the 
cluster  are  used  to  determine  this  value. 

6.  Algorithm:  Choose  or  devise  an  algorithm  for  the  problem  defined  in  Steps  4  and  5. 

7.  Computation:  Determine  the  clusterings  of  O  which  optimize  the  chosen  criterion,  with 
the  algorithm  of  Step  6. 

8.  Interpretation  of  results;  Try  to  use  descriptive  statistics  to  summarize  the  character¬ 
istics  of  each  cluster. 

We  now  turn  our  attention  to  the  description  of  a  class  of  concrete  clustering  algorithms 
referred  to  as  “sequential  clustering.” 

3.2.3  Sequential  Clustering 

The  principle  of  sequential  clustering  is  as  follows.  Most  commonly  used  paradigms  in  cluster 
analysis,  such  as  hierarchical  clustering  and  partitioning,  imply  that  all  entities  should  be 
assigned  to  clusters.  However,  in  most  cases,  this  is  unnatural:  some  entities  fit  poorly  into 
any  clusters,  or  they  are  just  noise.  Thus,  sequential  clustering  searches  within  the  data 
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(whatever  structure  there  is  and  nothing  more),  isolates  clearly  apparent  clusters  one  at  a 
time,  and  stops  when  this  cannot  be  done  anymore.  A  similar  procedure  is  often  used  in 
image  processing  where  objects  are  recognized  one  after  another. 

Following  the  basic  framework  of  the  previous  section,  the  sequential  clustering  scheme 
works  as  described  next. 

Step  1:  Sample.  Select  entities  among  which  clusters  are  to  be  found. 

Step  2:  Data.  Measure  characteristics  of  the  entities  and  obtain  the  sample  data  matrix  X. 

Step  3:  Dissimilarities.  Based  on  the  sample  data  matrix  X  , compute  dissimilarities  between 
pairs  of  entities,  and  obtain  the  dissimilarities  matrix  D  =  (d,j). 

Step  4:  Criterion.  Choose  a  criterion  to  evaluate  the  homogeneity  and  separation  of  the 
clusters  to  be  obtained. 

Step  5:  Choosing  I<  entities  among  N.  According  to  the  criterion  of  Step  4,  determine  a  best 
subset  of  I<  entities  of  O  as  an  optimal  cluster  (7* (usually  I<  is  considered  as  a  parameter). 

Step  6:  Significance  test.  Apply  formal  or  informal  tests  to  evaluate  if  the  cluster  C*  found 
in  Step  5  corresponds  to  some  part  of  the  structure  inherent  to  0,  or  only  to  noise.  In  the 
former  case,  record  the  list  of  entities  of  C*,  remove  them  from  0  and  return  to  Step  5;  in 
the  latter  case  proceed  to  Step  7. 

Step  7:  Interpretation  of  results.  Describe  the  clusters  found  sequentially  by  the  lists  of  their 
entities  and  various  techniques  of  descriptive  statistics.  Proceed  to  a  substantive  interpreta¬ 
tion  of  these  results. 

To  complete  a  seqential  algorithm,  one  must  specify  how  to  choose  K  entities  from 
N  entities  and  which  significance  test  applies.  It  is  also  important  to  choose  the  most 
appropriate  clustering  criterion  for  the  specific  application.  In  Figures  5,  6  and  7  we  illustrate 
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three  different  criteria  in  two-dimensional  space.  These  concepts  and  related  issues  will  be 
further  explored  in  future  research. 


Figure  5:  Sequential  clustering  with  minimum  diameter  criterion. 

In  summary,  clustering  is  a  statistical  tool.  It  aims  at  extracting  a  possible  cluster 
structure  from  a  large  data  set.  Based  on  the  selection  of  clustering  type,  criterion  for 
clusters  and  significance  test,  there  are  many  kinds  of  clustering  algorithms.  The  choice  of 
the  algorithm  depends  on  the  specific  problem  and  on  computational  capacity.  Traditional 
clustering  algorithms  are  tantamount  to  global  optimization  for  a  selected  objective  function 
(criterion).  They  often  imply  huge  computational  complexity  and  are  NP-hard.  As  the  need 
for  clustering  is  increasing,  especially  for  large-dimension  data  sets,  more  efficient  approaches 
are  required.  Some  new  approaches,  such  as  those  based  on  neural  networks  and  probabilistic 
methods,  have  emerged  and  have  shown  to  be  efficient  for  some  problems.  In  the  next  section 
we  review  the  basic  principles  of  an  approach  based  on  Adaptive  Resonance  Theory  (ART) 
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Figure  6:  Sequential  clustering  with  minimum  radius  criterion 
neural  networks,  and  present  concrete  algorithms  we  have  used  in  this  project. 

3.2.4  Clustering  using  Adaptive  Resonance  Theory  (ART) 

As  we  mentioned  in  the  Introduction,  this  part  of  our  work  is  motivated  by  our  Path  Bundle 
Grouping  approach  in  hierarchical  combat  simulation.  In  dealing  with  hierarchical  simulation 
models  one  needs  to  consider  grouping  the  sample  paths  generated  from  the  high  resolution 
simulators  so  as  to  provide  appropriate  input  statistics  to  the  lower  resolution  simulator. 
This  requires  clustering  very  high  dimensional  data  vectors  (the  sample  paths  from  the  high 
resolution  simulator).  Classical  clustering  algorithms  are  not  efficient  for  this  purpose.  We 
have  used  this  approach  in  the  Concept  Evaluation  Model  (CEM)  of  the  Concept  Analy¬ 
sis  Agency  in  order  to  group  the  sample  paths  from  the  high  resolution  Combat  Sample 


37 


Figure  7:  Sequential  clustering  with  maximum  split. 

Generator  (COSAGE)  and  generate  the  input  to  the  lower  resolution  Attrition  Calculation 
(ATCAL).  Concrete  numerical  results  are  reported  in  [17]. 

The  common  algorithm  used  for  clustering  in  the  ART  framework  is  closely  related  to  the 
well-known  /c-means  clustering  algorithm.  Both  use  single  prototypes  to  internally  represent 
and  dynamically  adapt  clusters.  The  fc-means  algorithm  clusters  a  given  set  of  input  patterns 
into  k  groups.  The  parameter  k  thus  specifies  the  coarseness  of  the  partition.  In  contrast, 
ART  uses  a  minimum  required  similarity  between  patterns  that  are  grouped  within  one 
cluster.  The  resulting  number  k  of  clusters  then  depends  on  the  distances  (in  terms  of  the 
applied  metric)  between  all  input  patterns,  presented  to  the  network  during  training  cycles. 
This  similarity  parameter  is  called  vigilance  and  is  denoted  by  p. 

The  first  step  in  the  ART  algorithm  is  the  preprocessing  stage.  It  is  the  creation  of  an 
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input  pattern  as  an  array  with  a  constant  number  of  m  elements.  ART  requires  the  same 
pattern  size  for  all  patterns,  i.e.,  the  dimension  of  the  input  space  into  which  all  cluster 
regions  shall  be  placed.  Any  of  the  already  formed  prototypes  is  of  the  same  dimension  m. 
In  addition,  the  elements  of  an  input  pattern  must  fit  constraints  concerning,  for  example, 
value  bounds  or  the  geometric  length  of  the  array  viewed  as  a  vector.  These  constraints 
are  characteristics  of  the  different  types  of  ART  networks  and  are  needed  to  make  the  input 
comparable  to  the  cluster  prototypes.  Once  the  input  pattern  is  formed,  it  is  compared  to 
the  n  stored  prototypes  in  a  search  stage.  If  the  degree  of  similarity  between  the  current 
input  pattern  and  the  best  fitting  prototype  J  is  at  least  as  high  as  a  given  vigilance  p,  this 
prototype  is  chosen  to  represent  the  cluster  containing  the  input.  The  degree  of  similarity 
is  typically  limited  to  the  range  [0,1].  If  the  similarity  between  input  pattern  and  best 
fitting  prototype  does  not  fit  into  the  vigilance  interval  [p,  1],  then  a  new  cluster  has  to  be 
installed,  where  the  current  input  is  most  commonly  used  as  the  first  prototype  or  “cluster 
center” .  Otherwise,  if  one  of  the  previously  committed  clusters  matches  the  input  pattern 
well  enough,  it  is  adapted  by  slightly  shifting  the  prototype’s  values  toward  the  values  of  the 
input  array. 

The  primary  processing  module  of  the  ART  network  is  a  competitive  learning  net¬ 
work,  as  shown  in  Figure  8.  The  m  neurons  of  an  input  layer  F\  register  the  values 
of  an  input  pattern  I  =  (*i,*V*  •>*»)•  Every  neuron  of  an  output  layer  F2  receives 
a  bottom-up  net  activity  tj ,  built  from  all  Fi-outputs  S  =  I.  The  vector  elements  of 
T  =  (ti,f2,"  •  ^n)can  be  seen  as  the  result  of  comparisons  between  input  pattern  I  and 
prototypes  Wx  =  (wiu  ■  ■  ■  ,wlm)t. .  .,Wn  =  {wnl,- ■  ■  ,wnm)  These  prototypes  are  stored  in 
the  synaptic  weights  of  the  connections  between  Fi-  andF2-  neurons.  Only  an  F2-  neuron 
J ,  receiving  the  highest  net  activity  tj,  sets  its  output  to  1,  while  all  other  output  neurons 

remain  0: 

f  1  if  tj  >  ma x(tk  :  k  ±  j)  (15\ 

—  |  0  otherwise 
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One  possible  way  to  compute  net  activities  tj,  and  by  that  measure  the  similarity  between  I 
and  Wj,  is  the  weighted  sum 

m 

tj  =  Wij  (16) 

i=  1 


Variations  on  this  measure  are  often  employed  because  the  value  tj  exerts  great  influence 
on  the  resulting  cluster.  After  an  ^-winner  J  has  been  found,  the  corresponding  prototype 
Wj  =  (wu,  •  •  • ,  wmj)  is  adapted  to  the  input  pattern  I.  One  suitable  method  for  adaptation 
is  to  move  Wj  slightly  toward  input  pattern  I  as  follows: 

vrrw)  =  ,./  +  (!  (17) 

where  the  constant  learning  rate  rj  €  [0,  1]  is  chosen  to  prevent  prototype  Wj  from  moving  too 
fast  and  therefore  destabilizing  the  learning  process.  Prototypes  for  this  kind  of  competitive 
learning  network  can  be  initialized  either  with  random  values  or  with  values  of  randomly 
chosen  input  patterns  from  the  training  sequence. 

Competitive  learning  networks  of  this  kind  tend  toward  unstable  categorization  whenever 
the  distances  between  single  input  patterns  vary  in  too  wide  a  range.  Additionally,  there  is 
no  way  to  control  either  the  number  of  clusters  produced  by  the  network,  or  the  minimum 
similarity  of  patterns  in  one  cluster.  In  ART,  this  problem  is  solved  by  extending  the 
competitive  learning  network  as  shown  in  Figure  9.  A  second  set  of  connections  is  added, 
sending  the  F2-output  U  back  to  layer  F\.  The  synaptic  top-down  weights  Wij  of  these 
connections  are,  except  for  a  possible  scaling  factor,  identical  to  the  bottom-up  weights  Wij. 
The  top-down  net  activity  V  is  usually  calculated  by 

n 

vj  =  £  •  wa  (18) 

j=i 

This  leads  to 

V  =  U  •  Wa  =  Wj  (19) 
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Figure  8:  A  simplified  representation  of  the  competitive  learning  network 

because  all  F2-outputs,  except  tij,  are  set  to  0  [see  (1)].  So  input  layer  Fi  receives  prototype 
Wj,  representing  the  current  winning  cluster  J ,  as  net  activity.  Next,  the  most  complex 
part  of  signal  processing  in  ART  networks  takes  place,  i.e.,  matching  prototype  Wj  with 
input  pattern  I.  This  task  is  completed  in  ways  characteristic  to  the  different  types  of 
ART  networks  and,  uses  extensions  to  the  internal  structure  of  layer  F\.  This  yields  a 
single  matching  value  which  is  compared  to  the  vigilance  p,  defining  the  minimum  similarity 
between  an  input  pattern  and  the  prototype  of  the  cluster  it  is  associated  with.  If  the 
matching  value  is  smaller  than  vigilance  p,  the  current  winning  F2-neuron  is  removed  from 
the  competition  by  a  reset  signal.  The  reset  signal  forces  the  activation  F2-neuron  J  to  0  and 
another  F2-neuron  is  activated,  receiving  the  highest  net  activity  tj  of  all  non-reset  output 
neurons.  Once  a  prototype  is  found  that  leads  to  a  matching  value  with  input  pattern  J, 
at  least  as  high  as  vigilance  p,  no  further  reset  signal  is  applied  and  the  network  attains 
resonance.  The  position  of  the  last  winning  F2-neuron  indicates  the  final  cluster  for  input 
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Figure  9:  Basic  layout  of  an  ART  network 
/,  and  the  associated  prototype  is  adapted. 

Figure  10  demonstrates  the  similarity  concept  in  the  ART2  network.  We  emphasize  that 
the  coordinates  may  be  scaled  to  provide  rich  flexibility. 

The  initial  values  of  prototypes  that  have  not  yet  been  accessed  by  an  input  pattern, 
provide  for  two  key  features: 

1.  Previously  accessed  prototypes  are  first  compared  to  the  input  pattern  before  an  un¬ 
committed  prototype  is  chosen. 

2.  If  none  of  the  committed  clusters  matches  the  input  pattern  well  enough,  search  will 
end  with  recruitment  of  an  uncommitted  prototype. 

The  basic  structure  of  an  adaptive  resonance  neural  network  involves  three  groups  of 
neurons: 
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Figure  10:  Similarity  in  ART2  is  measured  by  the  angle 

•  FI  layer:  the  input  processing  field 

•  F2  layer:  the  cluster  units 

•  Reset  mechanism:  the  mechanism  to  control  the  degree  of  similarity  of  patterns  placed 
on  the  same  cluster. 

The  FI  layer  has  two  parts:  the  input  portion ,  denoted  by  Fl(a),  and  the  interface 
portion ,  denoted  by  Fl(b).  In  ART2,  some  processing  may  occur  in  the  input  portion.  The 
interface  portion  combines  signals  from  the  input  portion  and  the  F2  layer  to  compare  the 
similarity  of  the  input  signal  to  the  weight  vector  for  the  cluster  unit,  which  has  been  selected 
as  a  candidate  for  learning. 

To  control  the  similarity  of  patterns  placed  on  the  same  cluster,  there  are  two  sets  of 
connections  (each  with  its  own  weights)  between  each  unit  in  the  interface  portion  of  the 
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input  field  and  each  cluster  unit: 


•  The  Fl(b)  layer  is  connected  to  the  F2  layer  by  bottom-up  weights  6^;  each  is  the 
weight  on  the  connection  from  i-th  FI  unit  to  the  j-th  F2  unit. 

•  The  F2  layer  is  connected  to  the  Fl(b)  layer  by  top-down  weights  t,-,-;  each  tji  is  the 
weight  on  the  connection  from  j-th  F2  unit  to  the  i-th  FI  unit. 

The  F2  layer  is  a  competitive  layer:  the  cluster  unit  with  the  largest  net  input  becomes 
the  candidate  to  learn  the  input  pattern,  the  activations  of  all  other  F2  units  are  set  to  0. 
Then,  the  interface  units  combine  information  from  the  input  and  cluster  units. 

Whether  the  cluster  unit  is  allowed  to  learn  the  input  pattern  depends  on  how  similar 
its  top-down  weight  vector  is  to  the  input  vector ;  the  decision  is  made  by  the  reset  unit, 
based  on  the  signals  it  receives  from  the  input  and  interface  portions  of  the  FI  layer.  If 
the  cluster  unit  is  not  allowed  to  learn,  it  is  inhibited  and  a  new  cluster  unit  is  selected  as 
the  candidate.  A  graphical  representation  of  an  ART2  neural  network  structure  is  shown  in 
Figure  11,  where  the  operations  involved  are  listed  below: 
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Figure  11:  Structure  of  a  typical  ART2  neural  network 

4  METAMODELING  USING  NEURAL  NETWORKS 

The  purpose  of  “metamodeling”  is  to  capture  complex  input/output  relationships  embodied 
in  a  simulator  so  as  to  obtain  simulation  data  quickly  and  efficiently.  It  is  well  known 
that  multilayer  neural  networks  are  universal  function  approximators  [9].  As  such,  they  are 
often  used  for  non-parametric  modeling,  and  are  well  suited  to  the  task  of  metamodeling,  as 
detailed  in  our  previous  report  [5].  To  briefly  summarize  the  discussion  that  may  be  found 
in  Section  3  of  [5],  a  neural  network  is  a  device  (i.e.,  software)  which  conceptually  consists 
of  many  “neurons”  with  “weights”  attached  to  them.  By  adjusting  the  weights,  a  neural 
network  is  capable  of  generating  response  surfaces  (i.e.,  outputs)  of  extreme  generality.  The 
idea  of  using  “neurons”  comes  from  the  “stimulus-response”  paradigm  that  our  human  brain 
neurons  seem  to  conform  to.  The  metamodeling  procedure  we  have  pursued  consiusts  of  the 
following  basic  steps. 

First,  a  large-scale  simulation  with  possibly  hundreds  or  thousands  of  inputs  and  hundreds 
or  thousands  of  outputs  is  executed.  The  neural  network  is  a  device  that  we  have  separately 
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designed  (completely  independently  of  the  simulator)  which  is  fed  by  the  exact  same  inputs 
and  outputs  of  the  simulator;  it  does  not,  however,  intrude  in  any  way  into  it.  While  the 
simulator  is  running,  the  neural  network  observes  the  inputs  and  outputs  and  it  “learns” 
from  them.  This  is  called  “training”  the  network.  One  can  visualize  the  neural  network  as 
an  “entity”  which  is  highly  intelligent  but  has  no  knowledge  of  anything  initially  to  apply 
its  intelligence  to.  As  it  observes  the  simulation  unfold,  however,  it  learns  from  the  basic 
cause-effect  (i.e.,  input-output)  relationships  it  observes.  In  fact,  all  the  neural  network  does 
is  adjust  its  weights  so  as  to  emulate  the  behavior  it  observes  as  closely  as  possible. 

When  the  training  is  done,  the  simulator  can  be  taken  away.  The  neural  network  is  now 
the  surrogate  model:  we  may  give  it  some  inputs  (as  if  we  were  giving  them  to  the  simulator) 
and  it  immediately  gives  us  an  output  (as  the  simulator  would).  So,  we  can  think  of  it  as  a 
“function”  which  responds  to  any  input  by  providing  some  output,  except  that  there  is  no 
explicit  mathematical  expression  or  formula  -  just  a  device  (a  software  routine)  that  acts  as 
the  model. 

The  main  advantages  of  a  neural  network  were  discussed  in  [5].  It  was  also  pointed  out 
that  in  order  to  take  full  advantage  of  such  an  approach,  one  must  design  the  neural  network 
appropriately  and  develop  efficient  ways  to  accomplish  the  all-important  training  process. 
In  [5],  we  also  introduced  the  Cascade  Correlation  Neural  Network  (CCNN)  as  a  type  of 
multilayer  neural  network  that  builds  itself  while  it  learns  [10,  15,  20].  The  CCNN  starts 
small  and  makes  itself  larger  during  training  which  usually  leads  to  faster  learning  and  better 
performance. 

One  of  the  objectives  of  this  project  was  to  pursue  the  study  of  benchmark  problems  and 
evaluate  the  effectiveness  of  metamodeling  using  neural  networks.  In  our  earlier  work,  we 
concentrated  on  the  Tactical  Electronic  Reconnaissance  Simulation  (TERSM)  model  [22]. 
TERSM  has  been  extensively  studied  and  was  used  by  previous  researchers  to  develop  and 
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evaluate  polynomial  metamodels  [8,  27,  26]  that  we  were  able  to  compare  with  our  proposed 
CCNN  metamodeling  approach  extensively  in  [5]. 

While  TERSM  has  provided  an  excellent  testbed  to  study  the  feasibility  of  the  CCNN 
metamodeling  approach,  it  lacks  some  of  the  features  that  constitute  real  challenges  to 
metamodeling.  For  one,  TERSM  lacks  significant  randomness.  As  provided  to  us,  TERSM 
is  deterministic.  For  another,  TERSM  does  not  exhibit  asymptotic  relationships  which  are 
very  common  in  practice.  Exposing  such  asymptotic  behaviors  often  requires  very  long 
simulation  runs.  Avoiding  long  simulation  runs  is  one  of  the  real  benefits  of  metamodeling. 
Additionally,  asymptotic  behaviors  can  be  very  difficult  for  polynomials  to  capture,  and 
provide  one  of  the  primary  motivations  for  the  use  of  more  powerful  metamodeling  methods 
like  the  CCNN  we  are  investigating.  With  this  motivation  in  mind,  we  introduced  in  [5] 
a  new  testbed  system  we  called  the  Aircraft  Refueling  and  Maintenance  System  (ARMS). 
This  system  can  be  viewed  as  a  high-resolution  component  of  a  combat  simulation  model 
whose  output  is  used  by  a  lower-resolution  model.  In  Section  5.1,  we  review  the  ARMS 
model.  In  Section  5.2,  we  present  results  from  a  simulation  study  we  have  performed,  and 
in  Section  5.3,  we  include  the  results  from  our  CCNN  metamodeling  effort  applied  to  the 
ARMS  benchmark  problem. 

4.1  The  Aircraft  Refueling  and  Maintenance  System  (ARMS) 

The  basic  ARMS  model  is  shown  in  Figure  12.  As  illustrated,  ARMS  is  a  multiclass  queueing 
system.  Jobs  from  each  class  n  =  1, ...  ,N  arrive  with  average  rates  An  to  separate  arrival 
queues  with  capacities  Cn  (possibly  infinite).  The  system  has  6  tokens.  At  block  1,  the  jobs 
compete  for  tokens  on  a  priority  basis,  with  the  class  1  jobs  having  the  highest  priority.  The 
job  waiting  in  the  highest  priority  arrival  queue  will  be  the  first  to  get  a  token  when  one 
becomes  available.  After  getting  a  token,  the  jobs  enter  a  service  queue  with  capacity  Cs. 
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At  block  2  the  jobs  are  selected  from  the  service  queue  according  to  some  service  discipline 
(e.g.,  first  in  first  out  (FIFO))  and  routed  to  one  of  k  =  1, . . . ,  K  servers.  The  time  to  service 
a  job  is  a  random  variable  //(n,  k )  which  can  vary  as  a  function  of  the  job  class  n  and  the 
particular  server  k.  Upon  completing  service,  the  jobs  proceed  to  block  3,  where  the  token 
is  returned  to  a  token  pool,  and  the  job  leaves  the  system. 


Figure  12:  The  basic  ARMS  queueing  model. 

The  basic  ARMS  queueing  model  is  very  general  and  can  be  used  to  represent  a  large 
variety  of  Air  Force  C4I  operations.  For  example,  such  a  model  can  be  used  to  represent 
computer  networks,  communications  systems,  or  logistics  problems.  The  specific  problem  we 
will  consider  is  the  aircraft  refueling  and  maintenance  system  (ARMS)  shown  in  Figure  13. 
The  ARMS  problem  has  aircraft  requesting  to  land  at  a  particular  site  (airport  or  aircraft 
carrier)  for  refueling  and/or  maintenance  purposes.  Depending  on  aircraft  type,  a  priority  is 
assigned  to  each  aircraft  so  that  high-priority  ones  are  served  first.  Since  landing  capacity  and 
associated  maintenance  resources  are  limited,  a  specific  number  of  “permits”  (i.e.,  tokens)  are 
available.  An  aircraft  is,  therefore,  forced  to  wait  until  it  receives  a  permit.  Upon  receiving 
a  permit,  the  aircraft  is  guided  to  a  refueling/maintenance  area.  If  the  resources  required  to 
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complete  the  refueling/maintenance  process  are  not  immediately  available  (e.g.,  personnel, 
tools,  spare  parts,  fuel),  the  aircraft  is  further  delayed.  When  the  aircraft  completes  service, 
the  permit  is  returned  to  the  permit  pool,  and  the  aircraft  proceeds  to  take  off  and  return  to 
action.  In  studying  this  system,  one  is  interested  in  minimizing  the  expected  “down  time” 
of  an  aircraft,  with  more  emphasis  given  to  certain  types  of  aircraft  (the  ones  given  higher 
priority).  At  the  same  time,  one  is  interested  in  keeping  service  costs  within  acceptable  levels. 
From  a  modeling  standpoint,  one  must  therefore  determine  functional  relationships  such  as 
the  expected  down  time  of  a  priority  1  aircraft  with  respect  to  factors  such  as  the  number 
of  permits;  or  the  number  of  maintenance  resources  allocated  to  the  refueling/maintenance 

process. 


Figure  13:  The  aircraft  refueling  and  maintenance  system  (ARMS). 

A  preliminary  analysis  focused  on  the  3-class,  single  server  ARMS  model  shown  in  Fig¬ 
ure  14  was  reported  in  [5].  In  this  model,  the  class  1  jobs  have  the  highest  priority,  and 
the  class  3  jobs  the  lowest.  Job  arrivals  are  assumed  Poisson  with  rates  A„,  where  n  is  the 
customer  class.  In  order  for  a  job  to  be  served,  it  must  have  one  of  6  tokens.  Jobs  with 
tokens  queue  up  to  be  served  by  a  single  server.  At  the  completion  of  service,  the  jobs  leave 
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the  system,  and  the  tokens  are  returned  to  the  token  pool  for  reuse.  When  different  classes 
of  jobs  are  competing  for  tokens,  the  class  with  the  highest  priority  gets  one  first.  Jobs  in 
the  same  class  compete  for  tokens  on  a  first  come,  first  served  (FCFS)  basis.  The  arrival 
queues  have  capacity  Cn,  and  the  server  queue  has  capacity  Cs  =  8.  Jobs  in  the  server  queue 
may  be  served  FIFO  (with  no  distinction  made  between  jobs  from  different  classes),  or  they 
may  be  served  according  to  priority  (with  the  highest  priority  jobs  moving  to  the  front  of 
the  queue).  In  any  case,  service  is  nonpremptive  (once  a  job  begins  service,  service  cannot 
be  interrupted,  and  will  continue  until  completion),  and  the  service  time  is  an  exponential 
random  variable  with  parameter  [in. 


Figure  14:  System  used  for  analysis. 


4.2  Modeling  and  Simulation  Analysis  of  ARMS 

In  order  to  study  the  ARMS,  i.e.,  understand  its  basic  dynamic  behavior  and  develop  a 
simulation  model  for  it,  a  detailed  discrete  event  system  (DES)  description  was  developed 
in  [5].  We  review  it  here  so  as  establish  the  basic  notation  and  terminology  required.  We 
begin  by  defining  the  state  of  the  ARMS  model  as  follows, 
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Qn  e  0, 1, . . . ,  Cn  -  number  of  class  n  jobs  in  the  n-th  arrival  queue 

Qq  e  0,1,..., 9  -  number  of  tokens  in  the  token  pool 

Qs  €  0, 1, . .  • ,  6  -  number  of  jobs  in  the  server  queue  (recall  Cs  =  9) 

where  1,2,3.  State  changes  in  the  system  are  caused  by  four  types  of  events, 

An  -  arrival  of  a  class  n  job  to  the  n-th  arrival  queue 
Aq  -  arrival  of  a  token  to  the  token  pool 
As  -  arrival  of  a  job-f token  pair  to  the  server  queue 
Ds  -  departure  of  a  job+token  pair  from  the  server 

Job  arrival  events  are  always  feasible.  Token  arrival  events  are  only  feasible  when  there  are 
jobs  being  serviced  (i.e.,  when  there  are  jobs  in  possession  of  a  token).  Server  arrival  events 
are  only  feasible  when  tokens  are  available  in  the  token  pool.  Departure  events  from  the 
server  are  only  feasible  when  there  a  job  is  being  serviced. 

The  state  transition  mechanism  describing  how  the  state  of  the  system  changes  in  re¬ 
sponse  to  the  various  events  is  given  below. 

Job  Arrival  -  An 

(1)  Qn  >  0,  Qn  <  Cn  — »  Qn  =  Qn  +  1 

(2)  Qn  =  Cn  Qn  =  Qn,  record  blocking  of  a  class  n  job 

(3)  Qn  —  0,  Qq  —  0  — ) ¥  Qn  =  1 

(4)  Qn  =  0,Qq>0  Qq  =  Qq-l,  schedule  As  to  occur  immediately 

(5)  Qn  =  0,Qq>0,Qs  =  0^Qq  =  Qq-  1,  schedule  Ds  to  occur  in  \xn  seconds 

In  equation  (1)  a  class  n  job  arrives  to  find  its  arrival  queue  not  empty,  but  not  at  capacity 
either.  The  job  is  added  to  the  back  of  the  queue,  where  it  must  wait  behind  the  other  jobs 
in  the  queue  before  it  can  compete  for  a  token.  Jobs  in  the  same  class  compete  FCFS  with 
other  jobs  in  the  same  class  for  a  token.  In  equation  (2)  a  class  n  job  arrives  to  find  its  arrival 
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queue  at  capacity.  Since  the  arrival  queue  is  full,  the  job  is  blocked,  and  not  allowed  to  enter 
the  system.  Blocking  is  usually  undesirable,  and  should  generally  be  avoided.  In  equation 

(3)  a  class  n  job  arrives  to  an  empty  arrival  queue,  but  finds  that  no  tokens  are  available. 
The  job,  therefore,  must  wait  in  its  arrival  queue  for  a  token  to  become  available.  In  equation 

(4)  a  class  n  job  arrives  to  an  empty  arrival  queue,  and  finds  a  token  available  (this  implies 
that  all  other  arrival  queues  must  be  empty).  The  job  takes  the  token  from  the  token  pool, 
and  proceeds  to  the  server  queue.  This  triggers  an  As  event  to  occur  immediately.  Finally 
in  equation  (5)  a  class  n  job  arrives  to  find  an  empty  arrival  queue,  an  available  token,  and 
an  empty  server  queue.  This  job  takes  a  token  from  the  token  pool,  and  immediately  begins 
service.  The  service  time  for  the  job  is  /in,  and  a  service  completion  event  Ds  is  scheduled 
to  take  place  in  fin  seconds. 

Service  Completion  -  Ds 

(1)  Qs  =  0  — >  record  performance  for  this  job,  schedule  Ag 

(2)  Qs  >  0  — »•  record  performance  for  this  job,  schedule  Ag,  Qs  =  Qs  —  1,  schedule  Ds 

In  equation  (1)  a  job  has  just  completed  service  and  no  other  jobs  are  waiting  in  the  server 
queue  for  service.  The  system  time  (down  time)  for  the  job  is  recorded,  the  job  leaves  the 
system,  and  the  token  is  returned  to  the  token  pool  (by  scheduling  an  Ag  event  to  occur 
immediately).  In  equation  (2)  a  job  has  just  completed  service,  and  other  jobs  are  waiting 
for  service  in  the  server  queue.  As  before,  the  system  time  for  the  job  that  just  completed 
service  is  recorded,  the  job  leaves  the  system,  and  the  token  is  returned  to  the  token  pool. 
In  addition,  the  server  queue  is  decremented,  and  the  job  at  the  front  of  the  server  queue 
begins  service.  The  time  to  service  this  job  will  be  fj,n,  and  a  service  completion  event  Ds  is 
scheduled  to  take  place  in  /j,n  seconds.  Note,  when  the  server  queue  is  a  priority  queue,  the 
high  priority  jobs  are  shuffled  to  the  front  of  the  queue.  Otherwise,  jobs  are  served  in  the 
order  they  arrived  to  the  server  queue,  independent  of  their  priority. 
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Server  Queue  Arrival  -  As 

(1)  Qs  =  0  — >  schedule  Ds 

(2)  Qs  >  0  — >  Qs  =  Qs  +  1 

In  equation  (1)  a  job+token  pair  arrives  to  the  server  queue.  Since  the  server  queue  is 
empty,  the  job  immediately  begins  service.  The  time  to  service  the  job  is  /in,  and  a  service 
completion  event  Ds  is  scheduled  to  take  place  in  fin  seconds.  In  equation  (2)  a  job+token 
pair  arrives  to  the  server  queue.  In  this  case,  the  server  queue  is  not  empty,  so  the  job  is 
added  to  the  queue.  If  the  server  queue  is  a  priority  queue,  the  job  is  placed  behind  the  last 
job  in  its  class.  Otherwise,  the  job  is  placed  at  the  back  of  the  queue,  independent  of  its 
priority. 

Token  Queue  Arrival  -  Ae 

(1)  Qo  >  0  — >  Qe  —  Qe  +  1 

(2)  Qe  =  0,  Qn  >  0  -»  Qn  =  Qn  ~  1  (highest  priority),  schedule  As 

(3)  Qg  =  0,  Qn  =  0  -»  Q$  =  1 

In  equation  (1)  a  token  is  returned  to  the  token  pool.  Since  the  token  pool  is  not  empty,  the 
token  is  added  to  the  pool.  Note,  the  token  pool  will  contain  tokens  only  when  all  arrival 
queues  are  empty.  In  equation  (2)  a  token  returning  to  the  token  pool  finds  that  the  token 
pool  is  empty,  and  that  there  is  a  job  in  at  least  one  of  the  arrival  queues.  In  this  case,  the 
job  at  the  front  of  the  highest  priority  arrival  queue  takes  the  token,  leaves  its  arrival  queue, 
and  proceeds  to  the  server  queue.  In  equation  (3)  a  token  returning  to  the  token  pool  finds 
that  the  token  pool  is  empty,  and  that  there  are  no  jobs  waiting  in  any  of  the  arrival  queues. 
In  this  case,  the  token  remains  in  the  token  pool. 

Based  on  the  DES  description  above,  a  simulator  was  developed,  and  data  were  collected 
to  assess  the  effects  that  the  various  parameters  have  on  the  system  performance.  The 
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Figure  15:  Effect  of  changing  the  class  2  arrival  rate  when  the  server  queue  has  priorities. 

performance  measure  we  use  is  the  mean  system  time  for  each  job  class.  The  system  time  is 
the  interval  from  when  a  job  arrives  to  the  system  to  the  time  the  job  completes  service  and 
leaves  the  system.  In  the  ARMS  problem,  the  service  time  is  the  “down  time”  of  an  aircraft. 
For  the  experiments  that  follow,  the  arrival  queues  are  assumed  to  have  infinite  capacities, 
i.e.,  Cn  —  oo  for  n  =  1,2,3. 

4.2.1  Arrival  Rate  Analysis 

Some  preliminary  results  for  this  analysis  were  reported  in  [5]  and  are  briefly  reviewed  here 
before  extending  them. 

Figure  15  shows  what  happens  when  the  server  queue  is  a  priority  queue,  which  allows 
the  higher  priority  jobs  to  jump  to  the  front.  In  this  case,  the  class  3  jobs  really  suffer.  Not 
only  do  the  class  3  jobs  get  starved  for  tokens,  but  even  when  they  do  get  a  token,  they 
keep  getting  pushed  to  the  rear  of  the  server  queue.  The  class  1  jobs,  on  the  other  hand,  are 
unaffected  by  the  arrival  rate  of  the  class  2  jobs. 

Figure  16  shows  the  effect  of  changing  the  arrival  rate  of  the  class  3  jobs  when  the  server 
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Figure  16:  Effect  of  changing  the  class  3  arrival  rate  when  the  server  queue  is  FIFO. 

queue  is  FIFO.  As  can  be  seen,  the  effects  of  changing  the  class  3  arrival  rate  are  very  similar 
to  the  effects  of  changing  the  class  2  arrival  rate.  As  the  class  3  arrival  rate  increases,  the 
system  times  of  the  other  two  classes  increase,  because  they  have  to  wait  longer  in  the  server 
queue.  But,  again,  no  matter  how  high  the  arrival  rate  of  the  class  3  jobs  gets,  the  system 
times  of  the  class  1  and  2  jobs  will  never  go  much  higher  than  about  21  seconds  (1  second 
waiting  for  a  token,  19  seconds  waiting  in  the  server  queue,  and  1  second  being  served). 

Figure  17  shows  what  happens  when  the  server  queue  is  a  priority  queue.  In  this  case, 
the  class  1  and  2  jobs  are  unaffected  by  the  class  3  arrival  rate. 

To  summarize,  we  have  seen  that  changing  the  arrival  rate  of  the  class  1  jobs  has  a 
significant  effect  on  the  system  times  of  the  other  classes.  When  the  server  queue  is  FIFO, 
with  no  distinction  between  classes,  we  saw  that  increasing  the  arrival  rate  of  the  lower 
priority  jobs  effects  the  system  times  of  the  higher  priority  jobs  by  forcing  them  to  wait 
longer  in  the  server  queue.  We  did  notice,  however,  that,  regardless  how  high  the  arrival 
rate  of  the  lower  priority  jobs  becomes,  the  system  times  of  the  high  priority  jobs  never 
exceeds  a  certain  ceiling  that  depends  on  the  number  of  tokens  in  the  system.  When  the 
server  queue  is  has  priorities,  and  allows  high  priority  jobs  to  jump  to  the  front  of  the  queue, 
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Figure  17:  Effect  of  changing  the  class  3  arrival  rate  when  the  server  queue  has  priorities. 

the  behavior  is  different.  In  this  case,  the  arrival  rates  of  the  lower  priority  jobs  have  little 
effect  on  the  system  times  of  the  high  priority  jobs.  The  low  priority  jobs,  however,  are 
severely  effected  by  increases  in  the  arrival  rates  of  high  priority  jobs.  We  also  noticed  a 
characteristic  which  is  typical  of  queueing  systems:  As  the  arrival  rate  increases,  the  service 
times  asymptotically  approach  infinity.  As  a  final  remark,  we  note  that,  as  the  arrival  rates 
becomes  high,  very  long  simulation  runs  are  required  to  collect  statistically  reliable  data. 
This  is  a  common  feature  of  queueing  systems:  When  the  system  is  running  at  or  near 
its  stability  limits  (the  point  where  jobs  begin  to  arrive  faster  than  the  server  can  possibly 
process  them),  long  simulation  runs  are  needed  to  collect  statistically  reliable  performance 
data. 


4.2.2  Service  Time  Analysis. 

Here  we  see  how  changing  the  service  time  of  the  class  n  jobs  affects  the  system  times  of 
the  other  classes.  As  before,  we  set  the  number  of  tokens  to  0  =  20,  so  that  the  token  loop 
can  be  neglected.  We  set  the  arrival  rates  for  each  class  to  be  equal  at  Ai  =  A2  =  A3  =  0.2 
jobs/second.  Then  we  varied  the  service  time  for  one  class  at  a  time.  The  service  times  for 
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Figure  18:  Effect  of  changing  the  class  1  service  time  when  the  server  queue  is  FIFO. 

the  classes  that  were  not  being  varied  were  set  to  fi  =  1  second.  For  the  experiments  in  this 
section,  the  server  queue  is  FIFO,  with  no  distinction  between  job  classes. 

Figure  18  shows  what  happens  when  we  change  the  service  time  of  the  class  1  jobs.  As 
we  see,  the  effect  is  very  similar  to  increasing  the  arrival  rate  of  the  class  1  jobs.  That  is, 
as  the  service  time  of  the  class  1  jobs  increases,  the  other  two  classes  become  starved  for 
tokens,  and  their  system  times  begin  to  increase.  The  mechanism  by  which  this  happens  is 
as  follows:  As  the  service  time  increases,  the  probability  that  a  class  1  job  will  be  waiting 
when  a  token  becomes  available  increases.  The  class  1  jobs  take  the  tokens,  and  the  other 
two  classes  are  forced  to  wait.  As  the  system  time  of  the  class  1  jobs  becomes  increasingly 
larger,  the  other  job  classes  never  get  a  token,  and  their  arrival  queues  become  unstable. 
Eventually,  the  arrival  rate  of  the  class  1  jobs  exceeds  the  rate  at  which  the  system  can 
process  them  (because  of  the  long  service  time),  and  the  class  1  arrival  queue  will  also  go 
unstable.  As  we  saw  before,  the  onset  of  instability  is  marked  by  an  asymptotic  rise  in  the 
system  time. 

As  seen  in  Figure  19  and  Figure  20,  the  effects  of  changing  the  service  times  of  the  class 
2  and  class  3  jobs  has  very  much  the  same  effect  as  changing  their  arrival  rates. 
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Class  2  Service  Tim* 

Figure  19:  Effect  of  changing  the  class  2  service  time  when  the  server  queue  is  FIFO. 


Class  3  Service  Time 

Figure  20:  Effect  of  changing  the  class  3  service  time  when  the  server  queue  is  FIFO. 
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To  summarize,  changing  the  service  time  has  an  effect  that  is  very  similar  to  changing 
the  arrival  rate.  As  the  service  time  of  the  class  1  jobs  increases,  these  jobs  utilize  most  of 
the  tokens,  and  the  other  two  classes  suffer.  As  the  service  time  of  the  class  2  jobs  increases, 
they  make  use  of  any  tokens  not  used  by  the  class  1  jobs,  and  the  class  3  jobs  suffer.  As  the 
service  time  of  the  class  3  jobs  increases,  the  other  two  classes  must  wait  behind  them  in  the 
server  queue.  As  before,  if  the  server  queue  were  a  priority  queue  and  not  a  FIFO  queue, 
the  lower  priority  classes  would  have  much  less  effect  on  the  higher  priority  classes. 

4.2.3  Token  Influence. 

Here  we  look  at  the  effect  of  changing  the  number  of  tokens.  To  do  so,  we  fix  the  arrival 
rates  of  the  three  job  classes  to  Ai  =  A2  =  A3  =  0.30  jobs/second  and  their  service  times  to 
=  1  second/job.  Then  we  varied  the  number  of  tokens. 

Figure  21  shows  what  happens  as  the  number  of  tokens  is  decreased  from  0  =  20  down  to 
0  =  1.  In  this  figure,  the  server  queue  is  FIFO,  with  no  distinction  between  classes.  What  is 
interesting  is  that  the  system  times  of  the  class  1  and  class  2  jobs  actually  goes  down  as  the 
number  of  tokens  is  decreased.  This  happens  because,  as  the  number  of  tokens  is  decreased, 
the  time  that  a  job  must  wait  in  the  server  queue  is  decreased.  Consequently,  the  overall 
system  time  is  decreased.  The  number  of  tokens  indicates  the  maximum  number  of  jobs  that 
can  be  present  in  the  server  queue.  When  jobs  in  the  server  queue  are  served  FIFO,  more 
tokens  means  longer  waiting  in  the  server  queue.  For  the  class  3  jobs,  however,  their  system 
times  increase  as  the  number  of  tokens  is  decreased.  This  is  because  the  probability  that  a 
class  1  or  class  2  job  will  be  waiting  when  a  token  becomes  available  increases  as  the  number 
of  tokens  decreases,  and  the  class  3  jobs  become  starved  for  tokens. 

Figure  22  shows  what  happens  when  the  server  queue  has  priorities.  As  can  be  seen,  the 
system  times,  in  this  case,  are  unaffected  by  the  number  of  tokens.  When  there  are  many 
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Figure  21:  Effect  of  changing  the  number  of  tokens  when  the  server  queue  is  FIFO. 
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Figure  22:  Effect  of  changing  the  number  of  tokens  when  the  server  queue  has  priorities. 
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Input  Variable 

Lower  Limit 

Upper  Limit 

Ai  (customers/sec) 

0.1 

1.0 

a2 

0.1 

1.0 

A3 

0.1 

1.0 

e 

1 

20 

Table  1:  Range  of  interest  for  each  input  variable 


tokens,  the  jobs  wait  in  the  server  queue.  When  there  are  few  tokens,  the  jobs  wait  in  their 
respective  arrival  queues.  The  amount  of  time  they  wait,  however,  does  not  change.  All  that 
changes  is  where  they  wait. 

4.3  Metamodeling  of  ARMS  using  a  Cascade  Correlation  Neural 
Network 

For  the  purposes  of  metamodeling,  we  concentrated  on  the  relationship  between  four  inputs: 
Ai,  A2,  A3  (arrival  rates  for  the  three  customer  classes),  and  0  (number  of  tokens);  and  the 
output,  J  =  si,  i.e.,  the  service  time  of  the  class  1  (highest  priority)  jobs.  The  service  time 
is  considered  to  be  the  time  from  the  job  arrival  at  the  system  to  the  time  the  job  finishes 
service  and  departs  the  system.  In  the  context  of  aircraft  refueling  and  maintenance,  this  is 
the  “downtime”  or  amount  of  time  that  an  aircraft  is  not  available  for  service.  The  range  of 
inputs  are  given  in  Table  1.. 

This  four-input  single-output  case  was  also  studied  using  polynomial  metamodels  based 
on  the  techniques  reported  in  [8,  27,  26]  and  also  used  in  our  earlier  work  [5]. 

Some  CCNN  results  on  ARMS  metamodeling  were  partly  reported  in  [6]  (included  in  the 
Appendix).  In  this  section,  we  have  assembled  all  our  numerical  results,  including  some  new 
3-D  plots  for  the  ARMS  model. 
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We  have  three  sets  of  data: 


•  Training  set  for  CCNN,  size=-500.  uniformly  randomly  generated  in  the  input  range 

•  Test  set,  size=500,  uniformly  randomly  generated  in  the  input  range 

•  Training  set  for  4th-order  polynomial,  size=49,  generated  by  layered  CCD  design  [5], [6]. 

The  4th  order  polynomial  metamodel  we  used  is  given  in  [6]  and  a  square  root  transfor¬ 
mation  to  the  output  is  used.  For  CCNN  training,  we  used  two  schemes: 

•  CCNN(I):  The  output  is  simply  rescaled  version  (here  the  factor  is  0.1)  of  the  real 
output 

•  CCNN(II):  Natural  log  transformation  to  the  real  output  is  used. 

When  the  arrival  rate  approaches  the  service  rate,  the  output  rapidly  increases,  so  it  is 
reasonable  to  make  use  of  a  log  transformation  which  has  the  property  of  “flattening”  the 
output  curve  and  making  the  learning  task  easier. 

Results  are  shown  in  Table  2.  The  CCNN(I)  in  the  table  has  25  hidden  units;  the 
CCNN(II)  has  23  hidden  units.  Although  the  training  of  the  CCNN  is  done  according  to 
the  mean  squared  error  (MSE),  we  also  look  at  the  mean  squared  relative  error  (MSRE) 
which  provides  a  better  picture  of  the  actual  metamodel  performance.  ^From  our  earlier 
work,  we  know  that  the  ARMS  output  can  explode  when  the  parameters  approach  a  certain 
region.  The  MSRE  is  therefore  more  appropriate  than  the  MSE  in  showing  how  a  metamodel 
performs  on  the  whole  range.  The  MSRE  is  defined  as 

11  i=  1 
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training  set  (size:49) 

training  set  (size:500) 

Polynomial 

1.874  (11.0%) 

22.76  (35.0%) 

MMHEaaa™ 

CCNN(I) 

85.81  (64.2%) 

3.438  (13.5%) 

13.01  (22.9%) 

CCNN(II) 

86.69  (27.0%) 

6.890  (12.3%) 

10.57  (16.3%) 

Table  2:  CCNN  and  polynomial  metamodeling  results  of  ARMS 

where  e,-  is  the  error  and  m  is  the  output.  The  numbers  in  the  Table  2  show  the  MSE  and, 
in  parenthesis,  the  square  root  of  the  MSRE  expressed  as  a  percentage  value. 

It  can  be  seen  from  Table  2  that  the  4th  order  polynomial  performs  well  on  the  49-point 
training  set.  However,  the  MSRE  values  are  quite  high  for  the  other  two  500-point  sets. 
This  is  attributed  to  the  fact  that  most  points  in  the  49-point  data  set  are  for  extreme  cases 
where  certain  A values  are  too  large  or  the  number  of  tokens  is  too  small. 

Note  that  CCNN(I)  achieves  good  training  results  without  too  many  hidden  units.  How¬ 
ever,  it  cannot  be  well  generalized  to  the  test  data  set  and  performs  poorly  for  the  49-point 
set  due  to  the  choices  of  extreme  values. 

The  CCNN(II)  results  show  improvement  for  all  data  sets,  especially  when  we  look  at 
the  MSRE  criterion.  This  is  due  to  taking  the  log  of  the  output,  which  makes  the  surface 
“flatter”  as  already  mentioned. 

Next,  we  present  some  3-D  plots  for  the  ARMS  model  in  Figures  23  through  31. 

Figures  23  through  25  show  the  case  where  we  fix  A2  =  0.25,  A3  =  0.3  and  vary  Xi  and 
6.  This  corresponds  to  a  case  where  class  2  and  class  3  jobs  have  relatively  comparable 
and  moderate  traffic.  We  can  observe  the  service  time  of  class  1  for  different  class  1  arrival 
rates  and  different  amounts  of  tokens.  From  Figure  23,  we  see  that  Si  increases  in  both 
dimensions  (Ai  and  6)  and  goes  up  drastically  on  the  edge  (0.8  <  Ai  <  1).  Figure  24  shows 
the  corresponding  CCNN(I)  metamodel  yielding  MSE  =  10.18  and  y/MSRE  29.1%. 
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Figure  25  is  the  CCNN(II)  result,  with  MSE  =  10.58  and  y/MSRE  =  19.1%. 


Figures  26  through  28  are  for  the  case  where  A2  =  0.2,  A3  =  0.4.  The  shape  is  not  much 
different  from  that  in  the  previous  setting.  The  performance  for  CCNN(I)  and  CCNN(II)  are, 
MSE  =  9.72,  y/MSRE  =  25.0%  and  MSE  =  8.15  and  \/MSRE  =  18.1%,  respectively. 

Figures  29  through  31  show  the  case  when  we  fix  Xx  =  0.3, 6  =  10,  that  is,  we  observe 
the  influence  of  lower  priority  classes  on  class  1  jobs  for  a  particular  class  1  arrival  rate 
and  amount  of  tokens  in  the  system.  From  Figure  29  we  see  a  surface  with  much  smaller 
variation  in  the  output.  The  surface  is  mostly  flat  for  large  values  of  A2  and  A3,  but  has 
a  rather  sharp  slope  at  the  corner  for  small  A2,  A3.  Figures  30  and  31  show  the  results  of 
CCNN(I)  and  CCNN(II),  with  MSE  =  1.153,  \JMSRE  =  26.9%  and  MSE  =  0.776  and 
y/MSRE  =  10.7%,  respectively.  It  is  interesting  to  note  that  CCNN(I)  seems  to  learn  the 
flat  shape  better  than  CCNN(II)  although  the  overall  performance  of  the  former  is  not  as 
good  as  that  of  the  latter. 

Observe  that  the  3-D  plots  for  the  simulated  data  show  some  very  rough  surfaces,  espe¬ 
cially  for  the  first  two  cases  where  the  output  increases  significantly.  The  spikes  at  the  edge 
are  the  result  of  insufficient  sample  sizes  (the  stopping  condition  for  all  our  simulations  is 
defined  by  the  number  of  jobs  served  exceeding  1000).  Recall  that  when  Xi  increases,  the 
queueing  system  will  become  unstable. 

Finally,  we  performed  another  series  of  experiments  where  the  sizes  of  training  sets  for 
both  polynomial  and  CCNN  metamodels  are  increased.  Results  are  given  by  Table  3.  The 
polynomial  is  obtained  using  a  500-point  training  set;  the  CCNN(II)  with  23  hidden  units  is 
trained  using  a  1000-point  data  set.  The  results  suggest  that  additional  training  points  for 
the  polynomial  metamodel  do  not  lead  to  better  performance,  and  the  same  is  true  for  the 
CCNN  metamodel. 
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Figure  23:  Simulation:  Output  £1  vs.  (Ai,0),  A2  =  0.25,  A3 


o  o 


6:  number  of  tokens 


Figure  24:  CCNN(I):  Output  Si  vs.  (Al50),  A2  =  0.25,  A3 
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Figure  26:  Simulation:  Output  Si  vs.  (Ai,0),  A2  =  0.2,  A3  =  0.4 
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Figure  27:  CCNN(I):  Output  Si  vs.  (Ai,0),  A2  =  0.2,  A3  =  0.4 
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49  point  set 


Polynomial 


49  point  set 


CCNN(II)  80.26  (23.2%) 


7.929  (12.7%) 


11.35  (13.7%) 


Table  3:  CCNN  and  polynomial  trained  using  larger  data  set 
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5  CONCLUSIONS  AND  FUTURE  RESEARCH  DI¬ 
RECTIONS 

In  this  section,  we  summarize  the  main  findings,  lessons  learned,  and  recommendations  for 
future  directions  that  have  resulted  from  this  project. 

Concurrent  and  Parallel  Simulation.  A  basic  theoretical  framework  underlying 
concurrent  simulation  algorithms  is  one  of  the  accomplishments  of  this  project,  capitalizing 
and  extending  the  results  of  our  earlier  work.  Based  on  this  framework,  we  have  developed  a 
general  (i.e.,  not  limited  by  specific  modeling  assumptions)  concurrent  simulation  approach 
implemented  through  the  Time  Warping  Algorithm  (TWA)  and  quantified  and  compared 
the  effectiveness  of  this  algorithm  to  conventional  repetitive  simulation  through  the  concept 
of  “speedup”.  We  have  found  that  the  speedups  achieved  by  concurrent  simulation  are 
significantly  affected  by  such  factors  as  the  particular  computer  hardware  used  and  the  data 
structures  selected.  This  suggests  the  need  for  a  further  exploration  of  different  ways  to 
implement  the  TWA  and  its  variants. 

Application  of  concurrent  simulation  techniques  to  parallel  processing  environments  promises 
at  least  one  additional  order  of  magnitude  in  speedup  over  concurrent  simulation  on  se¬ 
quential  processors.  The  integration  of  computer  simulation  methodologies  with  parallel 
processing  architectures  is  a  direction  that  holds  great  promise. 

Stochastic  Fidelity  and  Multi-resolution  Simulation.  One  of  the  accomplishments 
of  this  project  is  to  document  and  substantiate  the  stochastic  fidelity  issue  that  arises  in  hi¬ 
erarchical  simulation.  Simple  averaging  of  the  output  data  from  a  high  resolution  simulator 
to  generate  input  data  for  a  low  resolution  simulator  is  inadequate  and  occasionally  dramat¬ 
ically  erroneous.  Resolving  this  issue  is  associated  with  the  problem  of  systematic  clustering 
methods  that  “bundle”  high  resolution  output  data  in  a  way  that  incorporates  the  statistical 
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information  lost  through  simple  averaging.  Our  research  over  various  clustering  methods  has 
led  us  to  the  path  bundle  grouping  approach  based  on  Adaptive  Resonance  Theory  presented 
in  Section  3.2.4.  A  crucial  research  direction  to  be  pursued  in  the  future  is  the  development 
of  fast  clustering  algorithms  to  group  the  huge  dimensional  data  generated  from  typical  high 
resolution  simulators.  Our  investigation  of  the  ART2  neural  network  for  this  purpose  has 
led  to  promising  results  and  identified  key  issues  to  be  addressed. 

The  nature  of  this  line  of  research  is  such  that  significant  progress  is  made  through  ex¬ 
tensive  empirical  work  and  experimentation  with  benchmark  problems.  We  therefore  believe 
that  such  problems  need  to  be  defined  and  thoroughly  explored  at  the  same  time  as  specific 
analytical  or  numerical  tools  are  developed. 

Metamodeling  through  Neural  Networks.  The  metamodeling  procedure  we  have 
studied  combines  simulation  of  a  complex  system  with  the  process  of  training  a  neural 
network  to  become  a  surrogate  model  of  this  system.  This  exploits  the  ability  of  a  neural 
network  to  act  as  a  universal  function  approximator.  Using  a  Cascade  Correlation  Neural 
Network  (CCNN),  a  multilayer  neural  network  that  builds  itself  while  it  learns,  we  were  able 
to  study  concrete  models  and  problems  demonstrating  that  neural  network  metamodels  are 
significantly  more  accurate  than  their  polynomial  counterparts,  especially  when  asymptotic 
behavior  in  input-output  relationships  is  involved;  such  is  the  case  in  the  ARMS  case,  as 
detailed  in  Sections  4.2  and  4.3. 

Important  issues  we  have  identified  in  this  component  of  our  project  include:  (a)  De¬ 
termining  the  training  data  size  for  metamodeling  through  neural  networks.  An  interesting 
direction,  for  example,  is  the  investigation  of  adaptive  mechanisms  and  segmentation  of  a 
problem  to  develop  separate  models  for  different  regions  of  the  input  space,  (b)  Improving 
the  learning  efficiency  and  speeding  up  the  learning  process  of  a  neural  network.  As  an 
example,  starting  from  several  different  initial  points  (multi-starts)  seems  to  substantially 
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help  training,  but  has  yet  to  be  studied  in  a  systematic  framework. 
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APPENDIX 


This  Appendix  contains  reprints  of  publications  that  are  relevant  and  provide  supporting 
documentation  for  this  report. 
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Abstract.  The  sample  path  constructability  problem  for  Discrete  Event  Systems  (DES)  involves  the  observation 
of  a  sample  path  under  a  particular  parameter  value  6  of  the  system  with  the  requirement  to  concurrently  construct 
multiple  sample  paths  of  the  DES  under  different  values  using  only  information  available  along  the  given  sample 
path.  This  allows  the  on-line  estimation  of  performance  measures  J(6),  not  available  in  closed  form,  over  a 
range  of  values  of  6.  We  present  a  sample  path  coupling  approach  that  solves  the  problem  without  imposing 
any  restrictions  on  the  event  processes  in  the  system.  A  specific  “time  warping"  algorithm  is  described  and  its 
performance  is  analyzed  in  terms  of  computational  cost.  Our  approach  is  illustrated  through  a  number  of  simulation 
results. 

Keywords:  discrete  event  systems,  sample  path  construction,  concurrent  estimation,  timed  state  automaton 


1.  Introduction 

A  typical  problem  one  faces  in  the  design,  control,  and  optimization  of  Discrete  Event 
Systems  (DES)  is  that  of  determining  how  some  performance  measure  J{6)  varies  as 
a  function  of  some  parameter  6.  As  a  rule,  analytical  expressions  for  J{6)  are  simply 
unavailable,  thus  forcing  one  to  resort  to  repetitive  simulation  or  on-line  trial-and-error 
techniques.  This  requires  the  generation  of  a  sample  path  of  the  system  at  least  once  for 
every  value  of  6  of  interest.  It  is  easy  to  see  how  tedious  (if  at  all  feasible)  this  process 
becomes,  especially  when  6  is  a  vector  and  the  DES  is  stochastic,  requiring  large  number 
of  sample  paths  to  achieve  desired  levels  of  statistical  accuracy. 

It  is  by  now  well-documented  in  the  literature  that  the  nature  of  sample  paths  of  DES  can 
be  exploited  so  as  to  extract  a  significant  amount  of  information,  beyond  merely  an  estimate 
of  J (6).  It  has  been  shown  that  observing  a  sample  path  under  some  parameter  value  6 
allows  us  to  efficiently  obtain  estimates  of  derivatives  of  the  form  dJ/d6  which  are  in  many 
cases  unbiased  and  strongly  consistent  (e.g.,  see  Cassandras,  1993;  Glasserman,  1991; 
Ho  and  Cao,  1991)  where  Infinitesimal  Perturbation  Analysis  (IPA)  and  its  extensions  are 
described).  Similarly,  Finite  Perturbation  Analysis  (FPA)  has  been  used  to  estimate  finite 
differences  of  the  form  AJ(A6)  or  to  approximate  the  derivative  dJ/d9  through  AJ/A6 
when  other  PA  techniques  fail.  Of  particular  interest  are  often  parameters  6  that  take  values 
from  a  discrete  set  [di .....  9m }  (e.g.,  queueing  capacities,  threshold  values  in  certain  control 
policies),  in  which  case  we  desire  to  effectively  construct  sample  paths  under  any  0i , . . . ,  9m 
by  just  observing  a  sample  path  under  one  of  these  parameter  values.  All  of  the  methods 
developed  to  date,  regardless  of  specific  details,  have  been  motivated  by  the  same  objective: 
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From  a  single  sample  path  under  0  extract  information  to  estimate  the  derivative  d  J  jdO  or 
the  response  of  the  system,  J  {O'),  under  other  parameter  values  O'  9.  This  information 
can  be  extremely  useful  in  sensitivity  analysis  and  optimization  of  DES.  It  is  also  the  case 
that  structural  properties  of  a  DES,  such  as  monotonicity  and  convexity,  are  revealed  by 
this  type  of  sample  path  analysis  (see  also  Glasserman  and  Yao,  1994). 

In  this  paper,  we  will  concentrate  on  the  general  sample  path  constructability  problem  for 
DES.  That  is,  given  a  sample  path  under  a  particular  parameter  value  0,  the  problem  is  to 
construct  multiple  sample  paths  of  the  system  under  different  values  using  only  information 
available  along  the  given  sample  path.  A  solution  to  this  problem  can  be  obtained  when 
the  system  under  consideration  satisfies  the  Constructability  Condition  ( CO)  presented  in 
Cassandras  and  Strickland  (1989a, b).  Suppose  that  a  sample  path  of  the  system  is  observed 
under  parameter  0  and  we  would  like  to  construct  the  corresponding  sample  path  under 
some  O'.  Then  (CO)  consists  of  two  parts.  The  first  part  is  the  Observabilty  Condition 
(OB)  which  states  that  at  every  state  the  feasible  event  set  of  the  constructed  sample  path 
must  be  a  subset  of  the  feasible  event  set  of  the  observed  sample  path.  The  second  part  is 
a  requirement  that  all  lifetimes  of  feasible  events  conditioned  on  event  ages  are  equal  in 
distribution. 

Unfortunately,  (CO)  is  not  easily  satisfied.  Nonetheless,  two  methods  to  date  have  been 
developed  making  it  possible  to  construct  multiple  sample  paths  at  different  parameter 
settings  from  a  single  sample  path  at  some  extra  cost.  In  particular,  the  Standard  Clock 
(SC)  approach  (Vakili,  1991)  solves  the  sample  path  constructability  problem  for  models 
with  exponentially  distributed  event  lifetimes  by  exploiting  the  well-known  uniformization 
technique  for  Markov  chains.  This  approach  allows  the  concurrent  construction  of  multiple 
sample  paths  under  different  (continuous  or  discrete)  parameters  at  the  expense  of  introduc¬ 
ing  “fictitious”  events.  Chen  and  Ho  (1995)  have  proposed  a  generalized  SC  approach  that 
uses  approximation  techniques  to  extend  the  SC  approach  to  systems  with  non-exponential 
event  lifetime  distributions.  On  the  other  hand,  Augmented  System  Analysis  (ASA)  (Cas¬ 
sandras  and  Strickland,  1989a,b)  solves  the  constructability  problem  by  “suspending”  the 
construction  of  one  or  more  paths  during  certain  segments  of  the  observed  sample  path  in 
a  way  such  that  the  stochastic  characteristics  of  the  observed  sample  path  are  preserved.  In 
ASA,  it  is  still  necessary  to  assume  exponential  event  lifetime  distributions,  although,  with  a 
minor  extension  it  is  possible  to  allow  at  most  one  event  to  have  a  non-exponential  lifetime 
distribution  (see  Cassandras,  1993;  Cassandras  and  Strickland,  1989b  for  details). 

In  this  paper  we  develop  and  analyze  an  approach  for  solving  the  constructability  problem 
for  general  DES  (Cassandras  and  Panayiotou,  1996).  We  emphasize  that  the  proposed 
scheme  is  suited  to  on-line  sample  path  construction,  where  actual  system  data  are  processed 
for  performance  estimation  purposes.  In  contrast,  the  SC  approach  is  only  relevant  in  the 
context  of  simulation-based  analysis  (although  it  is  possible  to  adapt  the  SC  to  on-line 
applications  (Cassandras  et  al„  1990),  but  with  considerable  effort).  Moreover,  unlike 
SC  and  ASA,  our  approach  can  be  used  in  systems  with  arbitrary  lifetime  distributions 
without  involving  any  of  the  approximations  suggested  in  the  generalized  SC  method  in 
(Chen  and  Ho,  1995).  The  central  idea  of  our  approach  is  quite  simple:  when  an  event 
on  the  observed  sample  path  under  0  occurs  and  causes  a  state  transition,  its  lifetime  is 
stored.  These  stored  lifetimes  are  used  as  the  input  to  multiple  concurrent  sample  path 
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generators  (e.g.,  simulators)  under  different  parameter  values  9'  ?  9.  Thus,  instead  of 
separately  generating  lifetimes  for  each  event  which  is  common  in  the  observed  and  each 
constructed  sample  path  we  use  the  lifetimes  that  have  been  directly  observed.  Whenever  a 
constructed  sample  path  enters  a  state  such  that  the  feasible  event  set  includes  events  which 
have  not  yet  occurred  on  the  observed  sample  path  (i.e.,  no  lifetimes  have  been  observed 
for  these  events),  the  construction  is  suspended  until  the  required  events  and  corresponding 
lifetimes  become  available  on  the  nominal  path.  Therefore,  the  constructability  condition 
is  bypassed  at  the  expense  of  a  process  for  starting/stopping  each  sample  path  construction. 
In  this  paper  we  will  define  this  process,  study  its  properties  and  compare  its  effectiveness 
to  other  approaches.  The  viewpoint  we  adopt  in  the  sample  path  constructability  problem 
is  one  of  coupling  the  observed  process  under  9  to  all  processes  under  different  9^9 
and  then  deriving  event-driven  dynamics  that  describe  this  coupling.  Also,  note  that  in 
certain  cases  event  lifetimes  may  not  be  directly  observable,  in  which  case  they  need  to  be 
recovered  from  the  output  available,  a  problem  referred  to  as  invertibility  (Park  and  Chong, 

It  is  worth  pointing  out  that  this  concurrent  sample  path  constructability  approach  can  be 
used  in  two  different  modes.  Primarily,  it  can  be  used  on-line  as  a  concurrent  estimation 
scheme.  In  addition,  for  a  class  of  systems,  it  can  be  used  off-line  as  a  concurrent  simulation 
approach.  In  the  case  of  concurrent  estimation,  the  scheme  observes  a  real  DES  under  some 
parameter  value  9  and  estimates  the  system’s  performance  under  a  set  of  hypothetical  pa¬ 
rameter  values  9\,...,9m-  These  estimates  can  be  used  together  with  optimization  schemes 
requiring  such  information  (e.g.,  Yan  and  Mukai,  1992;  Gong  et  al.,  1992;  Cassandras  and 
Julka,  1994,  Panayiotou  and  Cassandras,  1996);  also,  see  (Panayiotou  and  Cassandras, 
1997)  for  an  application  of  the  scheme  in  a  dynamic  resource  allocation  problem).  In  the 
case  of  concurrent  simulation,  the  algorithm  developed  is  incorporated  into  a  simulation 
environment  in  order  to  generate  performance  estimates  under  a  range  of  parameters  faster 
than  repeatedly  simulating  the  system  under  each  parameter  (what  is  commonly  referred  to 
as  “brute-force  simulation”). 

The  paper  is  organized  as  follows.  In  section  2  we  formally  define  the  general  sample 
path  constructability  problem.  In  section  3  we  describe  the  process  through  which  the 
problem  is  solved  by  coupling  an  observed  sample  path  to  multiple  concurrently  generated 
sample  paths  under  different  parameter  settings,  and  a  detailed  procedure,  the  Time  Warping 
Algorithm  is  presented.  An  evaluation  of  the  proposed  algorithm  is  presented  in  section  4. 
Some  extensions  of  the  algorithm  are  discussed  in  section  5  and  several  simulation  results 
are  included  in  section  6.  Finally  we  close  with  conclusions  from  this  work  in  section  7. 


2.  Problem  Definition 

We  consider  a  DES  and  adopt  the  modeling  framework  of  a  stochastic  timed  state  automaton 
(£,X,r,f,x0)  (Cassandras,  1993).  Here,  £  is  a  countable  event  set,  *  is  a  countable  state 
space,  and  r(jt)  is  a  set  of  feasible  (or  enabled)  events,  defined  for  all  x  e  X  such  that 
F(x)  c  E.  The  state  transition  function  f(x,  e)  is  defined  for  all  x  €  X,  e  e  r(x).  and 
specifies  the  next  state  resulting  when  e  occurs  at  state  x.  Finally,  x0  is  a  given  initial  state. 
In  addition,  for  simplicity  we  assume  that  the  DES  satisfies  the  non-interruption  condition, 
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Figure  1.  The  sample  path  constructability  problem  for  DES. 


i.e.,  once  an  event  is  enabled  it  cannot  be  disabled;  this  is  not  essential  to  the  derivation  of 
our  results  however. 

Remark  The  definition  is  easily  modified  to  ( £,X,r,p,Po )  in  order  to  include  probabilistic 
state  transition  mechanisms.  In  this  case,  the  state  transition  probability  p(x';  x,  e')  is 
defined  for  all  x,  x'  €  X,  e'  e  £ ,  and  is  such  that  p(x';  x ,  e')  =  0  for  all  e'  $  TU).  In 
addition,  p0(x)  is  the  pmf  P[jc0  =x],x  e  X,  of  the  initial  state  x0. 

Assuming  the  cardinality  of  the  event  set  £  is  N,  the  input  to  the  system  is  a  set  of  event 
lifetime  sequences  {V|, . . . ,  Vw),  one  for  each  event,  where  V,  =  (v,(l),  v,(2), . . .}  is 
characterized  by  some  arbitrary  distribution.  Under  some  system  parameter  60,  the  output 
is  a  sequence  f  (0O)  =  [(ek,  tk),  k  =  1,2,...}  where  ek  s  £  is  the  kih  event  and  tk  is 
its  corresponding  occurrence  time  (see  Figure  1).  Based  on  any  observed  f  (0O),  we  can 
evaluate  L[£(0o)],  a  sample  performance  metric  for  the  system.  For  a  large  family  of 
performance  metrics  of  the  form  S(0O)  =  E[Z,[£(0O)]],  L[$(0O)]  is  therefore  an  estimate 

of  J(0U).  Defining  a  set  of  parameter  values  of  interest  [&o,Qi . 6M),  the  sample  path 

constructability  problem  is: 

For  a  DES  under  60,  construct  all  sample  paths  %(9\) . £(0M)  given  a  realization 

of  lifetime  sequences and  the  sample  path  £  (0O). 

For  simplicity,  in  the  rest  of  this  paper  we  assume  that  the  DES  under  investigation  satisfies 
the  following  three  assumptions.  Extensions  allowing  the  relaxation  of  these  assumptions 
are  possible  and  are  briefly  described  in  section  5. 

•  (Al)  Feasibility  Assumption :  Let  x„  be  the  state  of  the  DES  after  the  occurrence  of  the 
nth  event.  Then,  for  any  n ,  there  exists  at  least  one  r  >  n  such  that  e  e  T(xr)  for  any 
e  e  £. 
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•  (A2)  Invariability  Assumption :  Let  £  be  the  event  set  under  the  nominal  parameter  60 

and  let  £m  be  the  event  set  under  9m  #  60.  Then,  £m  =  £■ 

.  (A3)  Similarity  Assumption:  Let  G,  (Bo),  i  €  £  be  the  event  lifetime  distribution  for  the 

event  i  under  60  and  let  Gi(6m ),  i  €  f  be  the  corresponding  event  lifetime  distribution 
under  6m.  Then,  G,(0 0)  =  Gj(0„)  for  all  i  e  £. 

Assumption  A7  guarantees  that  in  the  evolution  of  any  sample  path  all  events  in  £  will 
always  become  feasible  at  some  point  in  the  future.  If  for  some  DES  assumption  A1  is 
not  satisfied,  i.e.  there  exists  an  event  a  that  never  gets  activated  after  some  point  in  time, 
then,  as  we  will  see,  it  is  possible  that  the  construction  of  some  sample  path  will  remain 
suspended  forever  waiting  for  a  to  happen.  Note  that  a  DES  with  an  irreducible  state  space 
immediately  satisfies  this  condition. 

Assumption  A2  states  that  changing  a  parameter  from  9q  to  some  6m  #  Vo  does  not  alter 
the  event  set  £.  More  importantly,  A2  guarantees  that  changing  to  &m  does  not  introduce 
any  new  events  so  that  all  event  lifetimes  for  all  events  can  be  observed  from  the  nominal 

sample  path.  Q 

Finally  assumption  A3  guarantees  that  changing  a  parameter  from  60  to  some  Vm  +  v0 
does  not  affect  the  distribution  of  one  or  more  event  lifetime  sequences.  This  allows  us  to 
use  exactly  the  same  lifetimes  that  we  observe  in  the  nominal  sample  path  to  construct  the 
perturbed  sample  path.  In  other  words,  our  analysis  focuses  on  structural  system  parameters 
rather  that  distributional  parameters.  As  we  will  see,  however,  it  is  straightforward  to  handle 
the  latter  at  the  expense  of  some  computational  cost. 


3.  Coupled  Sample  Path  Construction 

Before  presenting  the  coupling  approach  we  use  to  solve  the  constructability  problem  and 
the  explicit  procedure  we  will  refer  to  as  the  Time  Warping  Algorithm ,  let  us  consider  the 
following  motivating  example  in  order  to  illustrate  the  main  aspects  of  this  procedure. 

InaG/G/l/K  queueing  system,  the  event  set  is  £  =  [a ,  d)  (a  for  arrival,  d  for  departure), 
and  the  state  x  is  a  non-negative  integer  with  x<K  representing  the  number  of  customers 
currently  in  the  system.  Let  the  observed  sample  path  be  one  with  queue  capacity  K  -  2, 
and  let  us  try  to  construct  a  sample  path  under  K  =  3  in  the  framework  of  Figure  1.  Let 
r(x[K])  be  the  feasible  event  set  at  state  x  for  the  system  under  K  and  assume  that  both 
systems  are  initially  empty.  Unlike  earlier  sample  path  constructability  techniques,  i.e., 
the  SC  and  ASA  methods,  we  can  no  longer  maintain  between  the  two  sample  paths  (the 
observed  one  and  the  one  to  be  constructed)  a  coupling  which  preserves  full  synchronization 
of  events-  this  is  because  of  the  absence  of  Markovian  event  processes  which  allow  us  to 
exploit  the  memoryless  property.  Thus,  we  must  maintain  each  feasible  event  set,  T(x[2]) 
and  r(x[3]),  separately  for  each  observed  state  x[2]  and  constructed  state  x[3].  Whenever 
an  event  is  observed,  its  lifetime  is  assumed  to  become  available  (i.e.,  the  time  when  this 
event  was  activated  is  known).  Each  such  lifetime  is  subsequently  used  in  the  construction 
of  the  sample  path  under  K  =  3. 

To  see  precisely  how  this  can  be  done,  we  start  out  with  a  state  x  [3]  =  0  for  the  constructed 
sample  path,  so  that  r(x[3])  =  {a}.  Since  no  event  lifetimes  are  initially  available,  we 


consider  the  sample  path  of  this  system  as  “suspended”.  The  initial  state  of  the  observed 
sample  path  is  *[2]  =  0  so  that  r(jc[2])  =  {a}.  Therefore,  the  first  observed  event  is  a.  At 
this  point,  the  constructed  sample  path  may  be  “resumed”,  since  all  lifetimes  of  the  events  in 
f(jc[3])  are  now  available,  namely  the  lifetime  of  a.  The  constructed  sample  path  advances 
time,  and  updates  its  state  to  x[3]  =  1.  Now  T(x[3])  =  {a,  d)  but  neither  event  has  been 
observed  yet,  so  the  constructed  sample  path  is  suspended  again  until  at  least  one  a  and  one 
d  events  occur  at  the  observed  sample  path.  This  start/stop  (or  suspend/resume)  process 
goes  on  until  a  sample  path  under  K  =  3  is  constructed  up  to  a  desired  number  of  events 
or  some  specified  time.  Note  that  the  coupling  between  the  two  sample  paths  requires  a 
process  through  which,  at  every  observed  event,  one  must  determine  whether  it  is  possible 
to  resume  construction  of  the  suspended  path.  It  may  take  several  observed  events  before 
this  is  possible.  However,  it  is  also  possible  that  such  an  event  triggers  a  series  of  events 
on  the  constructed  sample  path,  and  hence  a  sequence  of  state  transitions  and  time  updates 
(e.g.,  if  T(jc[3])  =  {a,  d],  a  sequence  of  events  [a,  a,  a,  d)  will  cause  three  state  transitions 
in  a  row  as  soon  as  d  is  observed).  The  fact  that  in  this  process  we  move  backward  in  time 
to  revisit  a  suspended  sample  path  and  then  forward  by  one  or  more  event  occurrences  lends 
itself  to  the  term  “time  warping”. 

Despite  the  simplicity  of  the  main  concept,  the  formal  description  of  the  coupling  scheme 
and  the  process  through  which  the  start/stop  condition  is  determined  are  somewhat  tedious 
largely  due  to  the  notation  necessary.  In  what  follows,  we  introduce  some  basic  definitions 
and  notation  before  deriving  the  coupling  process  dynamics  and  describing  the  exact  sample 
path  construction  procedure. 


3.1.  Notation  and  Definitions 

First,  let  £  (n ,  6 )  =  [ej :  j  =  1 . n },  with  ej  €  £,  be  the  sequence  of  events  that  constitute 

the  observed  sample  path  up  to  n  total  events.  Although  £(n,  8)  is  clearly  a  function  of  the 
parameter  8,  we  will  write  £(n)  to  refer  to  the  observed  sample  path  and  adopt  the  notation 
%(k)  =  {ej:  j  =  1, ....  k)  for  any  constructed  sample  path  under  a  different  value  of  the 
parameter  up  to  k  events  in  that  path.  It  is  important  to  realize  that  k  is  actually  a  function 
of  n,  since  the  constructed  sample  path  is  coupled  to  the  observed  sample  path  through 
the  observed  event  lifetimes;  however,  again  for  the  sake  of  notational  simplicity,  we  will 
refrain  from  continuously  indicating  this  dependence. 

Next  we  define  the  score  of  an  event  i  €  £  in  a  sequence  £(n),  denoted  by  s"  —  [£  («)], ,  to 
be  the  non-negative  integer  that  counts  the  number  of  instances  of  event  i  in  this  sequence. 
The  corresponding  score  of  i  in  a  constructed  sample  path  is  denoted  by  s-  =  [£ (k)],.  In 
what  follows,  all  quantities  with  the  symbol  “  ?  ”  refer  to  a  typical  constructed  sample  path. 

Associated  with  every  event  type  i  e  £  in  $(n)  is  a  sequence  of  s"  event  lifetimes 

V,-(n)  =  (u,(l) . i», •(*")}  for  all  i  €  £ 

The  corresponding  set  of  sequences  in  the  constructed  sample  path  is: 

V, (*)-(»,(!) . for  all  f  €  £ 
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which  is  a  subsequence  of  V,  (n)  with  Jfc  <  n.  In  addition,  we  define  the  following  sequence 
of  lifetimes: 

V,(n,  Jfc)  =  +  1) . Hr(*f)}  for  all  i  e  5 

which  consists  of  all  event  lifetimes  that  are  in  V,  (n)  but  not  in  V,(&).  Associated  with  any 
one  ofthese  sequences  are  the  following  operations.  Given  some  W,-  =  [Wi(j) . ui,(r)}. 

Suffix  Addition:  W,  +  {i o,-(r  +  1)}  =  {u»«0'),  •  •  • .  w(r).  u>,-(r  +  1)}  and, 

Prefix  Subtraction:  W,  -  {u;,0)}  =  {u»,0‘  +  1), ....  u»,(r)}. 

Note  that  the  addition  and  subtraction  operations  are  defined  so  that  a  new  element  is  always 
added  as  the  last  element  (the  suffix)  of  a  sequence,  whereas  subtraction  always  removes 
the  first  element  (the  prefix)  of  the  sequence. 

Next,  define  the  set 

A{n,k)  =  {/:  i  €  £,s"  >  5,*}  (1) 

which  is  associated  with  V,  (n ,  k)  and  consists  of  all  events  i  whose  corresponding  sequence 
V,  (n,  k)  contains  at  least  one  element.  Thus,  every  i  €  A(n,  k)  is  an  event  that  has  been 
observed  in  £(n)  and  has  at  least  one  lifetime  that  has  yet  to  be  used  in  the  coupled  sample 
path  |  (Jfc).  Hence,  A(n,  Jfc)  should  be  thought  of  as  the  set  of  available  events  to  be  used  in 

the  construction  of  the  coupled  path. 

Finally,  we  define  the  following  set,  which  is  crucial  in  our  approach: 

M(n,  k)  —  T(.?*)  -  (r(x*_i)  -  {e*})  @) 

where,  clearly,  M(n,k)  C  £.  Note  that  ek  is  the  triggering  event  at  the  (k  -  l)th  state 
visited  in  the  constructed  sample  path.  Thus,  M(n,  k)  contains  all  the  events  that  are  in 
the  feasible  event  set  r(x*)  but  not  in  r(x*_i);  in  addition,  ek  also  belongs  to  M(n,  k)  if 
it  happens  that  ek  e  T(i*).  Intuitively,  Af(n,  k)  consists  of  all  missing  events  from  the 
perspective  of  the  constructed  sample  path  when  it  enters  a  new  state  xk:  those  events 
already  in  T(x*_i)  which  were  not  the  triggering  event  remain  available  to  be  used  in  the 
sample  path  construction  as  long  as  they  are  still  feasible;  all  other  events  in  the  set  are 
“missing”  as  far  as  residual  lifetime  information  is  concerned. 

The  concurrent  sample  path  construction  process  we  are  interested  in  consists  of  two 
coupled  processes,  each  generated  by  a  timed  state  automaton.  This  implies  that  there  are 
two  similar  sets  of  equations  that  describe  the  dynamics  of  each  process.  In  addition,  we 
need  a  set  of  equations  that  captures  the  coupling  between  them. 

3.2.  Timed  State  Automaton  Dynamics 

We  briefly  review  here  the  standard  timed  state  automaton  dynamics,  also  known  as  a 
Generalized  Semi-Markov  Scheme  (GSMS)  (see  Cassandras,  1993;  Glasserman,  1991;  Ho 
and  Cao,  1991).  We  introduce  two  additional  variables,  tn  to  be  the  time  when  the  nth  event 
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occurs,  and  y,(n),  i  G  r(x„),  to  be  the  residual  lifetime  of  event  i  after  the  occurrence  of 
the  nth  event  (i.e.,  it  is  the  time  left  until  event  i  occurs).  On  a  particular  sample  path, 
just  after  the  nth  event  occurs  the  following  information  is  known:  the  state  x„  from  which 
we  can  determine  r(x„),  the  time  t„,  the  residual  lifetimes  y,(n)  for  all  i  e  T(xn),  and  all 
event  scores  s",  i  e  £.  The  following  equations  describe  the  dynamics  of  the  timed  state 
automaton. 

Step  1:  Determine  the  smallest  residual  lifetime  among  all  feasible  events  at  state  jc„, 
denoted  by  y*: 

y*n  =  min  {y,(n)}  (3). 

leru.) 

Step  2:  Determine  the  triggering  event: 

en+,  =  arg  min  (y,(n)}  (4) 

iertx.) 


Step  3:  Determine  the  next  state: 


Xn+ 1  =  f(Xn,en+ 1) 


(5) 


Step  4:  Determine  the  next  event  time: 

W 1  =  tn  +  y« 


(6) 


Step  5:  Determine  the  new  residual  lifetimes  for  all  new  feasible  events  i  6  r(xn+,): 


Vi  (»+D  = 


y.(")  -  y„ 

v,(s?  + 1) 


if  i  #  e„+i  and  i  e  r(x,,) 
if  i  =  e„+i  or  i  $  T(xn) 


for  all  i  G  f(xn+i) 


(7) 


Step  6:  Update  the  event  scores: 

-«+i  _  {  s?  +  1  if  /  =  en+, 

'  |  s ”  otherwise 

Equations  (3M8)  describe  the  sample  path  evolution  of  a  timed  state  automaton.  These 
equations  apply  to  both  the  observed  and  the  constructed  sample  paths.  Next,  we  need 
to  specify  the  mechanism  through  which  these  two  sample  paths  are  coupled  in  a  way 
that  enables  event  lifetimes  from  the  observed  f  (n)  to  be  used  to  construct  a  sample  path 
f  (&).  First,  observe  that  the  process  described  by  (3M8),  applied  to  |(/:),  hinges  on  the 
availability  of  residual  lifetimes  y,  (/c)  for  all  i  g  T(Jc*).  Thus,  the  constructed  sample  path 
can  only  be  “active"  at  state  x*  if  every  i  G  r(x*)  is  such  that  either  i  e  (r(x*_i)  -  {ek}) 
(in  which  case  y;(&)  is  a  residual  lifetime  of  an  event  available  from  the  previous  state 
transition)  or  i  G  A(n,  k)  (in  which  case  a  full  lifetime  of  i  is  available  from  the  observed 
sample  path).  This  motivates  the  following: 
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Definition  1.  A  constructed  sample  path  is  active  at  state  xk  after  the  occurrence  of  an 
observed  event  en  if,  for  every  i  €  i  €  (r(i*_i)  —  {«*))  U  A(n,  k). 

Thus,  the  start/stop  conditions  for  the  construction  of  a  sample  path  are  determined  by 
whether  it  is  active  at  the  current  state  or  not. 


3.3.  Coupling  Dynamics 


Upon  occurrence  of  the  (n  +  l)th  observed  event,  e„+i,  the  first  step  is  to  update  the  event 
lifetime  sequences  V,  (n,  k)  as  follows: 


V, (n,  k)  +  u; (sf)  if  i  =  e„+i 
.  V,(n,  k) 


otherwise 


(9) 


The  addition  of  a  new  event  lifetime  implies  that  the  “available  event  set”  A(n,k)  defined 
in  (1)  may  be  affected.  Therefore,  it  is  updated  as  follows: 


A{n  +  l,k)  =  A(n,  k )  U  { e„+i } 


(10) 


Finally,  note  that  the  “missing  event  set”  M (n,  k)  defined  in  (2)  remains  unaffected  by  the 
occurrence  of  observed  events: 


M (n  +  l ,  k)  =  M (n,  k) 

At  this  point,  we  are  able  to  decide  whether  all  lifetime  information  to  proceed  with  a  state 
transition  in  the  constructed  sample  path  is  available  or  not.  In  particular,  the  condition 


M(n  +  l,k)QMn  +  l,k) 


(12) 


may  be  used  to  determine  whether  the  constructed  sample  path  is  active  at  the  current  state 
Xh  (in  the  sense  of  Definition  1).  The  following  is  a  formal  statement  of  this  fact. 

LEMMA  1  A  constructed  sample  path  is  active  at  state  xk  after  an  observed  event  e„+i  if 
and  only  if  M(n  +  1,  k)  C  A(n  +  1,  k). 

Proof:  Let  Bk  =  rfo-i)  -  {h}-  Suppose  M(n  +  1,  k)  C  A{n  +  1,*)  and  let «  €  T(^). 
Then,  consider  the  following  two  cases:  (i)  If  i  €  Bk,  then,  by  definition,  the  sample 
path  is  active  at  xk.  (U)  If  i  $  Bk,  by  the  definition  of  M(n,k)  in  (2),  it  follows  that 
i  6  M{n,k)  =  M(n  +  l,k)  and  since  M{n  +  1,  k)  c  A(n  +  \,k),wc  have!  e  A(n  +  \.k) 
which  implies  again  that  the  sample  path  is  active  at  state  xk. 

Conversely,  suppose  that  the  sample  path  is  active  at  state  xk.  Let  i  €  M(n  +  1,  k). 
Then,  from  (2),  i  e  T(xk),  but  i  $  Bk.  However,  since  the  sample  path  is  active,  we 
must  have  /  e  A{n  +  1,1k)  by  Definition  1.  Therefore,  M(n  +  1,^)  9  A(n  +  \,k). 


Assuming  (12)  is  satisfied,  equations  (3H8)  may  be  used  to  update  the  state  xk  of  the 
constructed  sample  path.  In  so  doing,  lifetimes  v,  (sf  +  1)  for  all  /  e  M(n  +  1,  k)  are  used 
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from  the  corresponding  sequences  V,  («  +  1  ,k).  Thus,  upon  completion  of  the  six  state 
update  steps,  all  three  variables  associated  with  the  coupling  process,  i.e.,  V,  (n,  k),  A(n,  k), 
and  M(n,k)  need  to  be  updated.  In  particular. 


V/(n  +  1,  k)  -  Vj(sf  +  1)  for  all  i  e  M(n  +  1  ,k) 
\i(n  +  l,k)  otherwise 


(13) 


This  operation  immediately  affects  the  set  A(n  +  1,  k)  which  is  updated  as  follows: 

A(n  +  1, +  1)  =  A(n  +  l,fc)  —  {i:  ieM(n  +  l,k),  r*+1  =  rf+l }  (14) 

Finally,  applying  (2)  to  the  new  state 

M(n  +  1,  k  +  1)  =  r(x*+1)  -  { r(xk )  -  {**+,})  (15) 

Therefore,  we  are  again  in  a  position  to  check  condition  (12)  for  the  new  sets  M  (n  + 1 ,  k  + 1 ) 
and  A(n  +  1,  k  +  1).  If  it  is  satisfied,  then  we  can  proceed  with  one  more  state  update  on 
the  constructed  sample  path;  otherwise,  we  wait  for  the  next  event  on  the  observed  sample 
path  until  (12)  is  again  satisfied.  Similar  to  Lemma  1,  we  have: 

Lemma  2  A  constructed  sample  path  is  active  at  state  xk+]  after  event  ek+\  if  and  only  if 
M(n  +  l,k  +  l)  c  A(n  +  l,k  +  l). 

Proof:  This  is  similar  to  the  proof  of  Lemma  1.  Let  Bt+i  =  r(Jc*)  —  {e*+i}-  Suppose 
M{n  +  1,  k  +  1)  C  A(n  +  1,  k  +  1)  and  let  i  e  F(£*+1).  Then,  consider  the  following 
two  cases:  (i)  If  i  e  Bk+U  then,  by  definition,  the  sample  path  is  active  at  state  jc*+].  (ii)  If 
i  £  Bk+ 1,  by  (15),  it  follows  that  i  e  M(n  +  l,k  +  1)  and  since  M (n  +1,^  +  1)  c 
A(n  +  1,  k  +  1),  we  have  i  €  A(n  +  1,  k  +  1)  which  implies  again  that  the  sample  path  is 
active  at  state  xk+\. 

Conversely,  suppose  that  xk+\  is  active.  Let  i  €  M(n  +  1,&  +  1).  Then,  from  (15), 
i  e  r(**+i),  but  i  g  Bk+\.  However,  since  the  sample  path  is  active,  we  must  have 
i  e  A(n  +  1,  k  +  1)  by  Definition  1.  Therefore,  M(n  +  1,  k  +  1)  C  A(n  +  1,  k  +  1). 


The  analysis  above  is  summarized  below  in  the  form  of  the  following  Time  Warping 
Algorithm  (TWA). 


Time  Warping  Algorithm  (TWA): 

1.  INITIALIZE 

n  :=  0,  k  :=  0,  t„  :=  0,  f*  :=  0,  xn  :=  x<>,  =  ■*(>. 

yt(n)  =  Vj(l)  for  all  i  e  r(x„),  s"  =  0,5*  =  0  for  all  i  e  £, 

A/(0, 0)  :=  r(x0),  A( 0, 0)  :=  0 

2.  WHEN  EVENT  en  IS  OBSERVED: 

2.1  UseOHSltodetermineefl+i.Xfl+i.tn+i.y/fn  +  llforalli  €  r(xn+i), s"+l  for 
all  i  e  £. 
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2.2 


2.3 


2.4 


2.5 


Add  the  e„+i  event  lifetime  to  V,  (n  +  1,  k): 


V,-(n  +  l.t) 


=  *,<»». 
V,(n. 


k)  +  vt(s")  if  i  =  en+i 
k)  otherwise 


Update  the  available  event  set  A(n,  k):  A(n  +  1,  k)  =  A(«,  k)  U  {e„+\ } 

Update  the  missing  event  set  M(n,  k):  M(n  +  1,  k)  =  M(n,k) 

IF  M(n  + 1  ,  k)  c  A(n  +  1,  k )  then  Goto  3.  ELSE  setn  «-  n  +  1  and  Goto  2.1. 


3.  TIME  WARPING  OPERATION: 

3.1  Obtain  all  missing  event  lifetimes  to  resume  sample  path  construction  at  state 
xk- 


9iik)  = 


Vi  (S*  +  1)  for  i  6  M(n  +  1,  k) 
yi(k  -  1)  otherwise 


3.2  Use  (3)-(8)  to  determine  ei+i,it+i,tt+i,yi(^  +  l)  for  alii  e  r(i*+i)n(r(i*) 
{e*+i}),  if+l  for  alii  6  S. 

3.3  Discard  all  used  event  lifetimes: 

V,(n  +  1,  k  +  1)  =  V,(«  +  1,  k)  -  v.Cs*  +  1)  for  alii  6  M(n  +  l,k) 

3.4  Update  the  available  event  set  A(n  +  1,  k): 

A(n  +  l,k  +  l)  =  A(n  +  l,k)-{i:  i  eM(n  +  U),  5*+I  =^"+1} 

3.5  Update  the  missing  event  set  M{n  +  1 ,  k): 

M(n  +  1,  k  +  1)  =  r(i*+,)  -  (r(i*)  -  {e*+i}) 

3.6  IF  M(n  +  1,  k  +  1)  c  A(n  +  1,  k  +  1)  then  k  <-  k  +  1  and  Goto  3.1.  ELSE 
k  <—  k  +  \,n  4—  n  +  1  and  Goto  2.1. 

Remark.  If  TWA  is  used  on  line,  step  2.1  is  taken  care  of  automatically  by  the  actual  system. 
The  additional  operations  involved  are  steps  2.2-2.4  and  checking  the  condition  in  step 
2.5.  If  the  latter  is  satisfied  then  the  time  warping  operation  in  step  3  is  responsible  for 
constructing  a  segment  of  the  desired  sample  path  for  as  many  events  as  possible,  depending 
on  step  3.6. 

Clearly,  the  computational  requirements  of  TWA  are  minimal  (adding  and  subtracting 
elements  to  sequences,  simple  arithmetic,  and  checking  condition  (12)).  Rather,  it  is  the 
storage  of  additional  information  that  constitutes  the  major  cost  of  the  algorithm.  Next  we 
analyze  the  algorithm  in  terms  of  its  computational  efficiency  and  applicability.  Further¬ 
more,  in  section  5  we  describe  extensions  to  TWA  allowing  us  to  relax  the  assumptions 
presented  in  section  2. 
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4.  Algorithm  Evaluation 


The  evaluation  of  the  TWA,  in  terms  of  its  computational  efficiency  as  an  estimation  tech¬ 
nique,  depends  on  whether  it  is  used  on  line  or  off  line.  In  the  former  case,  data  from  a 
DES  are  directly  observed  and  processed  to  concurrently  construct  sample  paths  over  a  set 
of  different  parameter  settings.  Aside  from  the  storage  requirement  for  event  lifetimes,  the 
computational  cost  of  the  algorithm  is  minimal.  Thus,  while  for  DES  with  exponential  event 
lifetime  distributions  the  ASA  and  SC  methodologies  discussed  in  section  1  are  clearly  more 
efficient,  the  ability  to  handle  non-Markovian  DES  on  line  comes  with  minimal  additional 
cost. 

In  the  off-line  case,  the  data  processed  by  the  TWA  are  obtained  from  a  simulation  model  of 
a  DES.  Once  again,  if  the  model  is  characterized  by  exponential  event  lifetime  distributions, 
then  the  ASA  and  SC  methodologies  are  much  more  efficient  (see  section  6.3).  Let  us 
therefore  concentrate  on  a  comparison  of  the  TWA  with  repetitive  simulation  for  each 
parameter  value  (i.e.,  “brute-force”  simulation),  which  is  the  obvious  alternative  for  non- 
Markovian  systems.  In  this  case,  a  basic  first  question  is  whether  the  TWA  is  indeed  more 
efficient  than  brute-force  simulation.  Thus,  if  we  let  Tbf  be  the  total  simulation  time 
(in  CPU  time  units)  required  to  generate  the  sample  paths  £(0i), . . . ,  through  M 
individual  simulations  (brute  force  simulation)  and  Tcs  be  the  time  required  to  generate  the 
same  sample  paths  through  concurrent  simulation,  then  the  natural  requirement  is  that 


Tcs  <  Tbf  (16) 

Observe  that  the  operations  in  step  3.2  of  TWA  are  the  same  as  those  in  step  2.1;  these  are 
required  to  update  the  state  of  the  DES  as  described  through  (3H8)  and  must  be  also  carried 
out  when  brute-force  simulation  is  used.  Therefore,  this  part  is  common  to  both  brute-force 
simulation  and  TWA.  The  advantage  of  TWA  as  far  as  execution  time  is  concerned  is  that 
no  random  variate  generation  is  involved  in  any  of  the  concurrently  constructed  sample 
paths.  On  the  other  hand,  TWA  introduces  some  overhead  when  writing  to  and  reading 
from  memory  and  when  checking  the  subset  condition  in  step  2.5.  As  long  as  this  overhead 
is  less  than  the  time  taken  to  generate  the  random  variates  of  the  constructed  sample  path, 
(16)  holds.  In  the  next  section,  we  proceed  with  a  quantitative  comparison  between  the 
TWA  and  brute-force  simulation  and  define  a  convenient  measure  in  order  to  accomplish 
this  task. 


4.1.  The  Speedup  Factor 

To  define  the  speedup  factor  associated  with  concurrent  simulation,  suppose  that  the  sam¬ 
ple  path  constructed  through  our  coupling  approach  were  instead  generated  by  a  separate 
simulation  whose  length  is  defined  by  N  total  events.  Let  TN  be  the  time  it  takes  (in  CPU 
time  units)  to  complete  such  a  simulation  run.  Further,  suppose  that  when  the  nominal 
simulation  is  executed  with  TWA  as  part  of  it,  the  total  time  is  given  by  Tf,  +  rK,  where 
T°n  is  the  simulation  time  without  the  TWA  and  xK  is  the  additional  time  involved  in  the 
concurrent  construction  of  a  sample  path  with  K  <N  events.  We  then  define  the  speedup 
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factor  due  to  TWA  as 

_  Tn/N  (17) 

T  k/K 

Thus,  if  a  separate  simulation  (in  addition  to  the  one  for  the  observed  sample  path)  were 
to  be  used  to  generate  a  sample  path  under  a  new  value  of  the  parameter  of  interest,  the 
computation  time  per  event  is  Tn/N.  If,  instead,  we  use  the  coupling  approach  (i.e.,  TWA) 
in  conjunction  with  the  observed  path,  no  such  separate  simulation  is  necessary,  but  the 
additional  time  per  event  imposed  by  the  approach  is  r k/K,  where  K  <  N  in  general. 
Clearly,  5  >  1  is  required  to  satisfy  the  basic  requirement  expressed  in  (16). 


4.2.  An  Upper  Bound  for  Speedup 

To  determine  an  upper  bound  for  the  speedup  factor  S ,  first,  note  that  the  actual  CPU  time  for 
a  simulation  consists  of  two  components:  (a)  The  time  used  to  generate  random  variates  for 
all  distributions  involved,  and  (b)  The  time  used  in  all  other  simulation  execution  operations 
(updating  the  state,  clock,  and  event  list).  Therefore  we  can  write: 

Tn  =  aTN  +  (1  -  a)TN  (18> 

where  TN  is  the  total  CPU  time  to  simulate  N  events  in  the  nominal  sample  path,  and  a  is 
the  fraction  of  time  used  for  generating  random  numbers  and  variates  (0  <  a  <  1).  We 
assume  a  to  be  independent  of  N  at  steady  state,  which  is  true  for  sufficiently  large  values 
of  N. 

Now  suppose  that  using  the  information  obtained  from  the  nominal  sample  path  we 
construct  a  new  sample  path  (under  a  different  parameter  setting)  with  K  <  N  events  using 
the  TWA.  From  the  TWA  it  is  clear  that  the  CPU  time  to  update  the  state,  clock  and  event  list 
in  the  nominal  and  perturbed  sample  paths  are  the  same  since  steps  2.1  and  3.2  are  identical. 
When  simulating  K  events,  this  time  is  (1  -  a)TK .  Furthermore,  the  overhead  involved 
in  implementing  TWA  consists  of  two  parts:  (a)  The  time  taken  to  write  to  and  read  from 
memory,  denoted  by  rKy  which  depends  on  the  number  of  random  variates  observed  in  the 
nominal  sample  path  and  used  in  the  constructed  sample  path,  and  (b)  The  time  taken  to 
check  the  subset  condition  in  step  2.5,  denoted  by  qx-  Note  that  the  latter  check  takes  place 
with  every  observed  or  constructed  event.  It  follows  that  qx  is  independent  of  a.  The  actual 
value  of  qx  depends  on  the  cardinality  of  the  set  M  (n,  k),  denoted  by  \M(n,  k)|.  Therefore, 
since  the  TWA  does  not  involve  any  random  number  generation,  the  total  CPU  time,  xK, 
required  to  construct  a  sample  path  with  K  events  is  obtained  by  combining  the  second  part 
of  (18)  with  the  read/write  overhead  component  above: 

Ur  =  (1  —<x)Tk  +r/c  +qK 

Another  way  of  viewing  the  total  time  xK  is  based  on  the  following  observation.  When  the 
TWA  is  executed  along  the  nominal  sample  path,  certain  functions  are  performed  whenever 
an  event  is  observed  (i.e.,  saving  the  event  lifetime  and  checking  (12));  therefore  the  total 
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time  involved  in  such  functions  is  independent  of  K  and  will  be  denoted  by  a.  The  remaining 
functions  are  performed  whenever  a  constructed  event  is  processed  and  the  corresponding 
time  is  denoted  by  b.  We  can  therefore  write 

zK  =  a  +  bK  (20) 


and  it  follows  that  r k/K  =  b  +  a/K  is  a  decreasing  function  of  K.  Moreover,  an  upper 
bound  for  K  is  N,  since  the  constructed  sample  path  depends  on  the  lifetimes  observed  in 
the  nominal  sample  path;  hence,  r n/N  <  r k/K.  We  are  now  in  a  position  to  obtain  an 
upper  bound  for  the  speedup  factor  as  follows: 


Tn/N  <  Tn  _  Tn  ^  Tn 

xk/K  ~  tn  (1  —  ot)Ts  +ru  +qn  ~  (1  -  a)Tn  +  r n 


(21) 


As  far  as  the  last  inequality  is  concerned,  we  have  chosen  to  ignore  qn  since,  as  will  be 
shown  in  section  4.3,  q n  is  small  when  \M(n,  fc)|  is  small.  Finally,  define 


P  =  (22) 

Tn 

and  substitute  in  (21)  to  get 

S  < - 1?- - = - l- -  (23) 

~  (1  -a)TN+pTN  l  +  P  —  a 

This  implies  that  the  bound  on  the  maximum  achievable  speedup  is  a  function  of  the 
structural  properties  of  the  actual  DES  we  simulate  and  the  simulator  used,  manifested 
through  a,  and  the  specific  processing  environment  used  which  determines  p.  Note  also 
that  the  speedup  as  described  in  (21)  is  proportional  to  the  ratio  K/N,  but  for  the  calculation 
of  the  bound  we  assumed  that  this  ratio  takes  its  maximum  value,  i.t.K/N  =  1.  This  implies 
that  the  larger  the  value  of  the  ratio,  the  tighter  the  bound.  On  the  other  hand,  note  that  it  is 
possible  that  for  certain  systems  this  is  not  the  case  which  implies  that  the  observed  speedup 
may  be  considerably  less  than  the  upper  bound.  In  the  worse  case  scenario,  K  =  0,  which 
means  that  the  speedup  will  also  be  zero. 

Figure  2  shows  some  typical  plots  of  the  speedup  upper  bound  as  a  function  of  a  for 
various  values  of  p.  One  can  see  that  concurrent  simulation  is  most  beneficial  when  used  to 
construct  sample  paths  for  DES  that  require  a  large  number  of  random  variates  of  complex 
distributions.  More  specifically,  concurrent  simulation  is  desirable  when  p  <  a  where  it 
is  feasible  to  expect  speedup  factors  greater  than  1,  i.e.,  concurrent  simulation  reduces  the 
total  simulation  time.  Note  however,  that  P  <  a  does  not  guarantee  a  speedup  greater 
than  1  since  the  actual  speedup  includes  q k  which  has  been  ignored  in  the  derivation  of 
the  bound  (see  (21)).  Thus,  the  smaller  qK  is,  the  tighter  the  bound.  This  motivates  the 
discussion  on  regular  DES  presented  next. 


4.3.  Regular  Discrete  Event  Systems 

As  discussed  in  the  previous  section,  systems  with  a  small  cardinality  | M ( n ,  k)\  for  the  set 
M(n,k)  attain  higher  speedup  factors.  Intuitively,  systems  with  a  large  \M(n,k)\  imply 
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Figure  2.  Speedup  as  a  function  of  a. 

that  the  simulator  will  spend  a  longer  time  updating  the  event  list,  which  in  turn  implies  that 
a  is  reduced,  hence  the  speedup  is  also  reduced  as  suggested  by  Figure  2.  Thus,  the  role  of 
| M{n,  fc)|  is  important  in  our  approach.  The  following  result  shows  that  \M(n,  &)|  can  be 
bounded  by  a  quantity  a  which  depends  entirely  on  the  structural  properties  of  a  DES. 

LEMMA  3  Let  m(n,  k)  -  \M(n,k)\  be  the  cardinality  of  the  set  M  (n,  k).  Then, 
m{n,  k)  <  a  +  1 

where  a  is  a  non-negative  integer  such  that: 

<7=max{<7*}  k  =  1,2,... 
k 


and  0^  =  |r(ijt)  —  r(x*_i)|  ' 

Proof:  If  ek  $  T(^),  then,  by  the  definition  of  M (n,  k)  in  (2),  it  is  clear  that  M(n,  k)  = 
p(^)  _  r(Jc*_i),  therefore  |M(n,  k)\  =  ctk.  If,  on  the  other  hand,  ek  e  T (x* ) ,  then  ek  is 
an  additional  element  that  belongs  to  M(n,  k ),  i.e.,  Af(n,  k)  =  (TU*)  -  r(x*_i))  U  (et), 
hence  \M(n,  Jk)|  =  ok  +  1.  Therefore,  the  result  follows.  ■ 

In  simple  terms,  ok  is  the  cardinality  of  a  set  that  contains  all  events  that  are  feasible  at 
state  xk  but  which  were  not  feasible  in  the  previous  state  £*-i.  It  is  important  to  observe 
that  o  indeed  depends  entirely  on  the  structural  properties  of  a  DES  and  is  therefore  not 
difficult  to  evaluate. 
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Clearly,  the  minimal  value  of  a  is  a  =  0.  In  this  case,  whenever  an  event  occurs  at  any 
state,  no  new  events  are  ever  triggered.  It  is  easy  to  see  that  this  corresponds  to  a  DES 
with  T(jc)  =  £  for  all  states  x,  i.e.,  a  system  where  all  events  are  always  feasible.  Such 
DES  are  often  of  interest;  however,  they  then  trivially  satisfy  the  constructability  condition 
(CO),  therefore  the  SC  or  ASA  techniques  are  preferable  (see  also  Cassandras  and  Pan, 
1995). 

The  next  most  desirable  class  of  DES  (in  the  sense  of  maximizing  the  speedup  factor)  is 
that  characterized  by  a  =  1.  This  motivates  the  following 

Definition.  A  DES  is  said  to  be  regular  if 

o  =max :{|IV)  -  TU)|  :  x'  =  f(x,e),ee  T(x)}  <  1  (27) 

We  refer  to  this  as  the  Regularity  Condition  or  condition  (R). 

It  turns  out  that  regularity  is  not  a  particularly  restrictive  condition.  For  example,  we 
show  next  that  a  large  and  particularly  useful  class  of  DES  are  regular. 

PROPOSITION  1  All  DES  represented  through  open  or  closed  Jackson-like  queueing  net¬ 
works  satisfy  the  regularity  condition. 

Proof:  Consider  a  Jackson-like  queueing  network  with  n  nodes  and  let  x  =  Ul . Xn) 

be  a  typical  state,  where  x,  is  the  number  of  customers  in  the  ith  node.  Let  T(x)  be  the 
feasible  event  set  at  that  state.  Every  event  in  T(x)  is  either  an  external  arrival  at  some 
node  k,  denoted  by  ay  (if  the  system  is  open)  or  a  departure  from  some  node  i,  denoted  by 
di.  If  there  are  multiple  customer  classes,  each  such  event  is  further  characterized  by  its 
corresponding  class;  however,  as  will  be  clear,  the  argument  that  follows  is  independent  of 
the  number  of  customer  classes  present.  Denote  the  current  state  by  x  and  the  next  state  by 
x'  =  f(x,  e)  such  that  e  is  either  some  arrival  a*  or  some  departure  dk. 

If  ay  is  the  next  event  that  occurs  at  state  x  then  the  following  two  cases  are  possible: 

(i)  if  xy  =  0,  then  r(x')  =  T(x)  U  {d*}  and  therefore  |r(x')  -  r(x)|  =  1 

(ii)  if  xy  >  1  then  F(x')  =  T(x)  and  therefore  ITU')  -  T(x)|  =  0. 

If,  on  the  other  hand,  the  next  event  is  some  d,-,  then  again  two  cases  are  possible: 

First,  if  the  departing  customer  leaves  the  system  (if  it  is  an  open  system),  then,  r(x')  = 
T(x)  -  {d,}  if  x'j  =  0  and  T(x')  =  T(;t)  if  x\  >  0.  Thus,  |r(x')  -  T(x)|  =  0. 

Second,  if  the  departing  customer  is  routed  to  a  node  j,  then,  there  are  the  following 
subcases: 

(i)  x[  =  0,  x'j  =  1.  Then,  T(x')  =  (T(x)  -  {d,})  U  [dj],  and  |T(x')  -  T(x)|  =  1. 


(ii)  x’  =  0,  x'j  >  1.  Then,  T(x')  =  T(x)  -  (d,),  and  IfV)  -  T(x)|  =  0. 
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(Hi)  x'  >  0,  x]  =  1.  Then,  V)  =  IV)  U  [dj),  and  |IV)  -  T(x)|  =  1. 

(iv)  x'  >  0,  xj  >  1.  Then,  IV)  =  IV),  and  |IV)  -  IV)|  =  0. 

It  follows  that  in  all  cases  |IV)  -  r(x)|  <  1  and  the  proof  is  complete.  ■ 

It  is  actually  possible  to  incorporate  features  such  as  non-FIFO  scheduling  disciplines 
and  preserve  the  regularity  property  of  such  DES.  On  the  other  hand,  allowing  for  finite- 
capacity  queues  and  “manufacturing  blocking”  generally  increases  the  value  of  a  and 
violates  regularity. 

Example.  Consider  zG/G/l/K  system,  with**  €  {0, 1, ....  K )  and  event  set£  =  {a,d). 
lfxk  =  land**-]  =  0  (i.e.,  the  kth  event  was  an  arrival),  then  ak  =  |{a,d}-{a}|  =  |{d}|  = 
1.  For  any  other  state  transition,  xk  =  x,  xk-i  =  x  —  1  (arrival),  or  xk  =  x  —  1,  x*-i  =  x 
(departure),  such  that  x  >  1,  ak  =  ITU*)  —  rv*_i)|  =  0.  Therefore,  a  =  max*{cr*)  =  !• 
Note  that  if  K  =  oo,  we  still  have  a  =  1. 

Example.  In  a  serial  network  of  two  G/G/l/K  queues,  let  xu  x2  denote  the  number  of 
customers  in  queue  or  server  1  and  in  queue  or  server  2  respectively.  Assume  that  queue 
1  has  infinite  capacity  whereas  queue  2  can  only  accommodate  K  -  1  customers.  Then, 
x\  e  {0,  1, . . .},  x2  €  {0, 1, . . . ,  K),  and  the  event  set  is  Z  =  {a,  d\,  d2). 

1 .  Under  “communication  blocking”  (i.e.,  customers  finding  queue  2  full  are  lost): 

■If  (*i.*2)*-i  =  (0,0)and(xi,oc2)t  =  O.0), 0*  =  \{a,dx]  -  {a)|  =  |{di}|  =  1. 

If  (jci,  jc2)*-i  =  (1, 0)  and  (xi.V*  =  (0,  l),<r*  =  |{a,  d2)  -  {a,  d^}\  =  |{cf2}|  =  1- 
If  (jt| , x2)k-i  =  (0,  1)  and  (*i,  jc2)*  =  (1,  l),ok  =  \{a,dud2)-{a,d2)\  =  |Wi)|  =  1. 
For  all  other  state  transitions  ak  =  0.  Therefore,  a  =  1. 

2.  Under  “manufacturing  blocking”  (i.e.,  customers  finding  queue  2  full  must  wait  in 
server  1  for  an  empty  queueing  slot),  we  introduce  a  third  state  variable  b  e  {0,  1 }  such 
that  b  =  1  if  a  customer  at  server  1  is  blocked  and  b  —  0  otherwise. 

If  (jci , x2,  b)k-\  =  (0, 0, 0)andCri,*2,fc)*  =  (1,0,0),ct*  =  |{a,  d\\-[a)\  =  |{<fi)|  = 
1. 

If  (xltx2,b)k-i  =  (1,0,0)  and  (x\,x2,b)k  =  (0,  1,0),  ak  =  \[a,d2)  -  [a,d\)\  = 
|{rf2}|  =  1. 

If  Oi,*2.  (?)*-!  =  (0,  1.0)  and  (*i,.r2,  b)k  =  (1, 1,0),  ct*  =  \{a,d\,d2)  -  [a,d2}\  - 
=  1. 

If  (xltx2,b)k-i  =  (jci,  K,  1)  and  Ui ,x2,b)k  =  (x,  -  1,  /f,0),  ak  =  \[a,dt,d2}  - 
[a,d2)\  =  \[dx)\  =  \. 

For  all  other  state  transitions  ak  =  0.  Again,  therefore,  <7  =  1.  It  is  easy  to  check, 
however,  that  if  a  third  server  were  introduced  in  this  serial  network,  then  we  would 
have  <7  >  1. 
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5.  Extensions  of  the  TWA 


In  section  2  we  stated  a  few  assumptions  that  were  made  to  simplify  the  development  of 
our  approach  and  keep  the  TWA  notationally  simple.  It  turns  out  that  we  can  extend  the 
application  of  TWA  to  DES  by  relaxing  these  assumptions  at  the  expense  of  some  extra 
work. 

In  A2  we  assumed  that  changing  a  parameter  from  &o  to  some  Gm  /  Go  does  not  alter  the 
event  set  £.  Clearly,  if  the  new  event  set,  £m  is  such  that  £m  C  £,  the  development  and 
analysis  of  TWA  is  not  affected.  If,  on  the  other  hand,  £  C  £m,  this  implies  that  events 
required  to  cause  state  transitions  under  6m  are  unavailable  in  the  observed  sample  path, 
which  make  the  application  of  our  algorithm  impossible.  In  this  case,  one  can  introduce 
phantom  event  sources  which  generate  all  the  unavailable  events  as  described,  for  example, 
in  Cassandras  and  Shi  (1996),  provided  that  the  lifetime  distributions  of  these  events  are 
known.  The  idea  of  phantom  sources  can  also  be  applied  to  DES  that  do  not  satisfy  Al. 
In  this  case,  if  a  sample  path  remains  suspended  for  a  long  period  of  time,  then  a  phantom 
source  can  provide  the  required  event(s)  so  that  the  sample  path  construction  can  resume. 

In  A3  we  assumed  that  changing  a  parameter  from  Go  to  some  9m  ^  Go  does  not  affect  the 
distribution  of  one  or  more  event  lifetime  sequences.  This  assumption  is  used  in  (9)  where 
the  observed  lifetime  v,  (s")  is  directly  suffix-added  to  the  sequence  V,  (« -4- 1 ,  k).  Note  that 
this  problem  can  be  overcome  by  transforming  observed  lifetimes  V,  =  (v,(l),  v, (2), . . .) 
with  an  underlying  distribution  G,-  {Go)  into  samples  of  a  similar  sequence  corresponding 
to  the  new  distribution  G,(0m)  and  then  suffix-add  them  in  V,(n  +  l.jfc).  This  is  indeed 
possible,  if  G,  (Go),  G,  (Gm)  are  known,  at  the  expense  of  some  additional  computational  cost 
for  this  transformation  (for  example,  see  Cassandras,  1993).  One  interesting  special  case 
arises  when  the  parameter  of  interest  is  a  scale  parameter  of  some  event  lifetime  distribution 
(e.g.,  it  is  the  mean  of  a  distribution  in  the  Erlang  family).  Then,  simple  rescaling  suffices 
to  transform  an  observed  lifetime  v,  under  Go  into  a  new  lifetime  5,  under  Gm: 

Vi  =  (Gm/60)Vi 

Finally  note  that  in  a  simulation  environment  it  is  possible  to  eliminate  the  overhead  qx 
which  is  due  to  checking  the  subset  condition  in  step  2.5.  In  order  to  achieve  this  we  need 
to  eliminate  the  coupling  between  the  observed  and  constructed  sample  paths.  Towards 
this  goal,  we  can  simulate  the  nominal  sample  but  rather  than  disposing  the  event  lifetimes 
we  save  them  all  in  memory.  Once  the  simulation  is  done,  we  simulate  one  by  one  all  the 
perturbed  sample  paths  exactly  as  we  do  with  the  brute  force  simulation  scheme  but  rather 
than  generating  the  required  random  variates  we  read  them  directly  from  the  computer 
memory.  In  this  way  we  trade  off  computer  memory  for  higher  speedup.  A  quantification 
of  this  tradeoff  is  the  subject  of  ongoing  research. 


6.  Simulation  Results 

Simulation  can  be  used  to  readily  verify  that  the  Time  Warping  Algorithm  generates  sample 
paths  identical  to  those  generated  by  a  separate  simulation  run  with  the  same  input.  In  what 
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Table  1.  Speedup  factors  for  Systems  1-2. 


System 

Speedup  factor 

comments 

1 

2.44 

utilization  0.2S 

2.20 

utilization  0.S 

2.44 

utilization  0.5  +  rare  events 

2.18 

utilization  0.75 

2 

3.64 

Erlang  order  2 

7,73 

Erlang  order  5 

16.48 

Erlang  order  10 

27.00 

Erlang  order  20 

follows,  we  shall  focus  on  studying  the  speedup  factors,  as  defined  in  (17),  obtained  for  a 
variety  of  DES. 


6. 1.  Speedup  Factor  for  Several  Systems 

We  briefly  describe  below  each  specific  DES  considered  with  the  speedup  factors  obtained 
in  a  particular  computer  environment  (486  PC)  used  in  this  study. 


System  1:  M/M/1  /K  with  Multiple  Classes  of  Customers 

This  is  a  single-server  queueing  system  serving  various  classes  of  customers  on  a  First-In- 
First-Out  (FIFO)  basis.  Systems  with  2  to  1 1  classes  were  implemented  and  each  class  has 
exponential  service  and  interarrival  times.  A  performance  analysis  problem  which  is  often 
of  interest  in  such  systems  is  estimating  the  blocking  probability  of  each  customer  class  as 
a  function  of  the  buffer  size  K.  Several  values  of  arrival  and  service  rates  were  used  so  as  to 
achieve  different  server  utilizations,  as  shown  in  Table  1.  In  addition,  experiments  included 
a  system  where  one  of  the  classes  had  a  very  low  arrival  rate  ( 1 00  times  slower)  in  order  to 
investigate  the  behavior  of  TWA  when  a  constructed  sample  path  may  be  suspended  for  a 
long  time  before  it  gets  a  lifetime  that  is  missing. 


System  2:  G/G/l /K  with  Multiple  Classes  of  Customers 

This  is  the  same  as  System  1  with  two  classes  of  customers,  but  the  service  and  interarrival 
times  are  now  obtained  from  an  Erlang  distribution  of  the  same  order  so  that  the  server 
utilization  is  0.67.  As  seen  in  Table  1,  the  speedup  factor  increases  with  the  order  of 
the  Erlang  distribution;  this  is  expected,  since  Erlang  random  variate  generation  is  more 
complex  than  the  exponential  one,  which  in  tum  implies  that  a  is  increased. 
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Table  2.  Speedup  factors  for  System  3. 


*1 

Mu 

M21 

*2 

Mi2 

M22 

Speedup  factor 

1.0 

4.0 

4.0 

1.0 

4.0 

2.63 

1.0 

3.0 

3.0 

1.0 

3.0 

4.0 

2.58 

1.0 

4.0 

4.0 

1.0 

4.0 

2.63 

1.0 

3.0 

3.0 

1.0 

3.0 

2.25 

Table  3.  Speedup  factors  for  System  4. 


System 

*1 

Mil 

M2I 

M3I 

*2 

MI2 

M22 

M32 

Speedup  factor 

manufacturing 

1.0 

5.0 

7.0 

12.0 

1.0 

5.0 

7.0 

12.0 

2.70 

blocking 

1.0 

3.0 

3.0 

3.0 

1.0 

5.0 

7.0 

9.0 

2.67 

1.0 

2.0 

6.0 

3.0 

2.0 

5.0 

4.0 

3.0 

2.62 

3.0 

20.0 

4.0 

4.0 

3.0 

20.0 

4.0 

4.0 

2.41 

communication 

1.0 

5.0 

10.0 

15.0 

1.0 

5.0 

10.0 

15.0 

2.66 

blocking 

1.0 

3.0 

3.0 

3.0 

1.0 

5.0 

7.0 

9.0 

2.49 

1.0 

2.0 

5.0 

3.0 

3.0 

5.0 

7.0 

5.0 

2.52 

2.0 

3.0 

10.0 

4.0 

5.0 

7.0 

10.0 

9.0 

2.50 

System  3:  Two  Queues  in  Series 


This  is  a  serial  network  of  two  M/M/l/K  queues  with  two  classes  of  customers,  that 
operates  under  “manufacturing  blocking”  i.e.,  customers  finding  queue  2  full  wait  in  server 
1  for  the  next  available  queueing  slot.  Both  servers  operate  under  a  FIFO  scheduling  policy. 
The  arrival  process  is  Poisson  and  the  two  servers  are  exponential.  The  performance  measure 
of  this  system  is  the  throughput  as  a  function  of  the  buffer  allocation,  i.e.  ( ATi ,  K2)  subject 
to  +  Ki  =  K  where  K[,  K2  are  the  number  of  buffers  allocated  to  each  server.  Table  2 
shows  some  typical  speedup  results  obtained  for  different  parameter  settings.  The  arrival 
rate  of  class  i  is  A.,-  and  its  service  rate  at  the  jth  server  is  /tty  (t ,  j  =  1 , 2). 


System  4:  Three  Queues  in  Series 


This  is  the  same  as  System  3  except  we  extend  it  to  three  queues.  Two  cases  are  considered: 
“manufacturing”  blocking  (in  which  case  we  have  cr  =  2  in  (27)  and  the  system  is  not 
regular)  and  “communication”  blocking,  (in  which  case  a  =  1  and  the  system  is  regular). 
Typical  results  are  shown  in  Table  3.  One  observation  is  that  the  lack  of  regularity  had 
minimal  effect  on  the  speedup  factor  in  this  case.  Another  is  that,  comparing  Tables  2  and 
3,  the  increase  in  size  of  the  system  from  two  queues  to  three  had  minimal  effect  on  the 
speedup  factors  observed. 
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Table  4.  Speedup  factors  for  System  5. 


*1 

Ml 

*2 

M2 

5W2-„ 

Speedup  factor 

1.0 

2.0 

1.0 

3.0 

0.2 

0.4 

2.51 

1.0 

2.5 

1.0 

5.0 

1.0 

0.5 

1.91 

1.0 

2.0 

1.0 

2.0 

0.5 

0.0 

1.74 

1.0 

2.0 

1.0 

2.0 

0.0 

0.0 

2.11 

System  5:  Two  Queues  in  Parallel  with  a  Single  Bulk-Service  Server 

A  single  server  is  servicing  two  classes  of  customers  using  a  Round  Robin  scheduling 
policy.  Both  customer  arrival  processes  are  Poisson  and  each  class  is  serviced  in  batches 
with  exponential  service  times.  Furthermore,  there  is  a  deterministic  time  delay  every 
time  the  server  switches  from  one  class  to  the  other,  denoted  by  SWi_>2  and  S  W2-,  1 .  The 
performance  measure  is  the  average  system  delay  as  a  function  of  the  batch  size  of  each 
class  of  customers.  Some  representative  results  are  shown  in  Table  4. 

As  already  mentioned,  the  main  cost  involved  in  using  TWA  is  the  storage  requirement  for 
event  lifetimes  in  V,  (n,  k)  for  all  i  e  £.  In  fact,  it  is  not  possible  to  bound  these  sequences 
in  our  approach.  In  practice,  this  implies  that  some  stored  event  lifetimes  may  eventually 
have  to  be  discarded  to  avoid  memory  overflows. 

The  performance  of  TWA  clearly  depends  on  the  DES  considered.  The  results  in  Table 
1  that  correspond  to  system  2  show  the  correlation  of  the  complexity  of  the  random  vari¬ 
ate  generation  processes  involved  with  the  performance  of  TWA.  Erlang  random  variates 
are  complex  and  require  CPU  intensive  operations  to  obtain,  thus  the  percentage  of  time 
spent  in  generating  random  variates  (i.e.,  a  as  defined  in  section  4.2)  increases,  which  in 
turn  increases  the  speedup  obtained  and  therefore  makes  TWA  a  more  attractive  approach. 
Conversely,  this  approach  is  not  recommended  for  systems  with  completely  deterministic 
event  processes. 

Lastly,  it  is  our  observation  that  for  the  systems  studied  performance  was  quite  sensitive 
to  specific  implementations  and  processing  architectures.  This  dependence  is  captured  by 
the  ratio  0  defined  in  section  4.2  and  it  is  shown  in  Figure  2  for  a  >  0.5.  Thus,  the  speedup 
factors  presented  in  the  tables  above  may  be  largely  dependent  on  the  computer  environment 
and  specific  implementation  of  TWA  one  adopts. 


6.2.  Statistical  Significance  of  Estimates  obtained  through  TWA 

As  indicated  earlier,  the  constructed  sample  path  may  remain  suspended  for  extended  periods 
of  time  while  waiting  for  one  or  more  of  the  missing  events.  This  in  turn,  implies  that  while 
the  length  of  the  observed  sample  path  {N)  is  long  enough  to  guarantee  that  the  observed 
measures  are  statistically  significant,  the  length  of  the  constructed  sample  path  (K)  may  not 
become  long  enough  to  provide  such  accuracy.  Figure  3  shows  the  ratio  (K/N)  for  System 
1  in  section  6.1  when  there  are  four  classes  of  customers  and  the  observed  sample  path  has 
five  buffer  slots.  First  note  that  when  all  events  occur  with  similar  frequency,  the  K/ N  ratio 
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Nominal  Events  (N)(xl000) 


Figure  3.  K/N  ratio  for  system  1. 


converges  quickly  ( M/M/X/l  curve),  whereas,  when  there  is  a  rare  event  (class  4  arrival) 
often  the  constructed  sample  path  is  forced  to  wait  for  long  intervals  until  the  rare  event  is 
observed  which  causes  the  K/N  ratio  to  become  small.  Once  the  missing  event  is  observed, 
a  large  number  of  events  may  be  immediately  processed  allowing  K /N  to  increase  again. 
This  results  in  the  initial  large  oscillations  observed  in  Figure  3  which  eventually  diminish 
as  N  grows  larger. 

In  addition,  note  that  the  parameter  settings  also  affect  the  K/N  ratio.  In  the  case  of  a 
single  buffer  slot  (Af/Af/1/1),  the  blocking  probability  is  much  larger  than  in  the  observed 
sample  path,  therefore,  several  observed  departure  events  are  not  constructed  because  the 
corresponding  customer  was  lost.  For  this  reason,  K/N  converges  to  a  value  less  than  one 
(in  this  case  0.85).  On  the  other  hand,  when  the  constructed  sample  path  has  nine  slots,  the 
observed  and  constructed  sample  paths  have  comparable  blocking  probabilities  therefore 
most  of  the  observed  events  are  also  constructed,  so  the  K/N  ratio  is  closer  to  one. 


6.3.  Speedup  Comparisons 

As  mentioned  earlier,  ASA  and  SC  are  two  methods  that  have  been  developed  for  constructing 
multiple  sample  paths  at  different  parameter  settings.  In  a  simulation  environment,  both 
techniques  can  achieve  higher  speedup  than  the  proposed  TWA,  as  indicated  in  Figure  4  for 
anM/M/l/K  system  studied  over  a  range  of  values  for  K .  ASA  can  achieve  a  speedup  of  up 
to  30,  considerably  higher  than  both  SC  and  7WA,  however,  it  is  only  applicable  to  systems 
with  exponential  event  lifetime  distributions  (with  the  possibility  of  one  non-Markovian 
event  process,  as  noted  in  section  1).  SC  can  achieve  a  speedup  of  up  to  8,  however  it  is 
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Figure  4.  Speedup  of  ASA,  SC  and  TWA,  for  an  M/M/l/K  system. 


also  limited  to  exponential  event  lifetime  distributions  (unless  approximations  are  used  for 
systems  with  general  distributions,  as  in  Chen  and  Ho,  1995).  In  Figure  4,  the  speedup  for 
the  TWA  turns  out  to  be  in  the  vicinity  of  2. 

Finally,  note  that  from  the  definition  of  speedup  (17),  one  would  expect  that  it  is  not  a 
function  of  the  number  of  concurrently  constructed  sample  paths.  Intuitively,  suppose  that 
one  is  interested  in  concurrently  constructing  M  sample  paths.  Therefore,  using  brute  force 
it  would  take  M  TN  time  units  to  generate  M  N  events,  while  using  any  other  constructability 
technique  it  would  require  MxK  time  units  to  construct  MK  events.  Hence,  the  speedup 
factor  should  be  independent  of  M.  However,  as  described  previously,  the  number  of 
constructed  events  depends  on  the  parameter  settings  as  well.  For  this  reason,  the  number 
of  constructed  events  is  given  by  MK  —  Kq  (not  MK).  Therefore, 

MTn/MN  Tn  (K  Ko  \ 

S  Mtk/(MK  -Kq)~  tk\N  Nm) 

which  implies  that  it  approaches  a  constant  as  M  becomes  larger. 


7.  Conclusions  and  Future  Work 

The  sample  path  coupling  approach  we  have  presented  is  intended  to  solve  the  constructabil¬ 
ity  problem  described  in  section  2  without  imposing  any  restrictions  on  the  event  processes 
in  the  DES  as  in  earlier  work.  The  approach  leads  to  the  specific  “time  warping  algorithm 
detailed  in  section  3.3,  which  was  analyzed  in  terms  of  computational  cost. 

As  pointed  out  in  the  Introduction,  we  emphasize  once  again  that  the  proposed  approach 
to  the  problem  of  constructability  is  suited  to  on-line  sample  path  construction  as  a  concur¬ 
rent  estimation  scheme.  In  addition,  it  may  be  used  as  an  off-line  concurrent  simulation 
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approach.  In  the  former  case,  computational  cost  is  not  an  issue,  as  the  procedure  we  use 
involves  only  simple  operations;  rather,  the  storage  requirement  for  observed  event  lifetimes 
becomes  the  limiting  factor. 

There  remain  a  number  of  issues  that  require  investigation  in  order  to  better  assess  the 
value  of  the  concurrent  sample  path  construction  method  we  have  presented.  First,  a  detailed 
analysis  of  the  extensions  allowing  us  to  relax  Assumptions  ( A1)-(A3 )  as  briefly  outlined 
in  section  5  is  the  subject  of  ongoing  work.  Second,  the  proposed  sample  path  coupling 
approach  is  not  limited  to  any  specific  performance  measure  of  interest.  However,  one  can 
expect  that,  depending  on  the  nature  of  a  performance  measure  to  be  estimated  it  should  be 
possible  to  utilize  only  part  of  a  constructed  sample  path,  hence  increasing  the  efficiency 
of  the  TWA  in  terms  of  speedup. 

Finally,  note  that  the  proposed  estimation  technique,  when  used  together  with  any  op¬ 
timization  algorithm  based  on  comparisons  of  the  system’s  performance  under  different 
parameters,  immediately  improves  the  convergence  rate  of  the  algorithm  because  concur¬ 
rent  estimation/simulation  inherently  uses  the  common  random  numbers  (CRN)  scheme 
which  has  been  observed  experimentally  and  was  proved  theoretically  in  some  cases  to  be 
effective  in  variance  reduction  (Dai  and  Chen,  1997). 
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Notes 

1 .  For  readers  familiar  with  the  structure  of  a  standard  discrete  event  simulator,  this  idea  applied  to  such  a 
simulator  implies  the  following:  when  an  event  in  the  "event  calendar”  is  processed  it  is  normally  discarded; 
what  is  proposed  here  is  that  it  should  not  be  discarded  but  saved  and  used  by  concurrently  generated  sample 
paths  in  their  respective  event  calendars. 
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Abstract 

We  consider  stochastic  discrete  optimization  problems 
where  the  decision  variables  are  non-negative  integers. 
We  propose  and  analyze  an  on-line  control  scheme 
which  transforms  the  problem  into  a  “surrogate”  con¬ 
tinuous  optimization  problem  and  proceeds  to  solve  the 
latter  using  standard  gradient-based  approaches  while 
simultaneously  updating  both  actual  and  surrogate  sys¬ 
tem  states.  Convergence  of  the  proposed  algorithm 
is  established  and  it  is  shown  that  the  discrete  state 
neighborhood  of  the  optimal  surrogate  state  contains 
the  optimal  solution  of  the  original  problem.  Numer¬ 
ical  results  are  included  in  the  paper  illustrating  the 
fast  convergence  properties  of  this  approach. 


1  Introduction 

We  consider  stochastic  discrete  optimization  problems 
where  the  decision  variables  are  non-negative  integers. 
In  the  context  of  resource  allocation,  for  example,  clas¬ 
sic  problems  of  this  type  include  buffer  allocation  in 
queueing  models  of  manufacturing  systems  or  commu¬ 
nication  networks  and  transmission  scheduling  in  radio 
networks.  In  the  context  of  Discrete  Event  Systems 
(DES),  integer- valued  control  variables  have  proven  to 
be  very  common  (e.g.,  as  threshold  parameters  in  many 
control  policies),  making  the  issue  of  optimizing  over 
such  variables  of  particular  interest. 

Let  r  £  Z+  be  the  decision  vector  or  “state”.  In 
general,  there  is  a  set  of  feasible  states  denoted  by 
Ad  such  that  r  €  Ad  represents  a  constraint.  For 
example,  in  a  typical  resource  allocation  problem,  rj 
denotes  the  number  of  resources  that  user  i  is  as¬ 
signed  subject  to  a  capacity  constraint  of  the  form 
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Ad  =  {r  :  Y^iLi  ri  —  K}-  In  a  stochastic  setting, 
let  Ld{r,u)  be  the  cost  incurred  over  a  specific  sample 
path  (denoted  by  u)  and  Jd(r)  be  the  expected  cost  of  a 
system  operating  under  r.  Then,  the  discrete  optimiza¬ 
tion  problem  we  are  interested  in  is  the  determination 
of  r*  €  Ad  such  that 

Jd(r*)  =min  Jd(r)  =  min  Eu[Ld(r,u>)  1  (1) 

r€Ad  r€Ad 

In  general,  this  is  a  notoriously  hard  stochastic  integer 
programming  problem.  Even  in  a  deterministic  set¬ 
ting,  where  we  may  set  Jd(r )  =  Ld{r,u),  this  class 
of  problems  is  NP-hard  (see  [14]  [12]  and  references 
therein).  In  some  cases,  depending  upon  the  form  of 
the  objective  function  Jd(r )  (e.g.,  separability,  convex¬ 
ity),  efficient  algorithms  based  on  finite-stage  dynamic 
programming  or  generalized  Lagrange  relaxation  meth¬ 
ods  are  known.  Alternatively,  if  no  a  priori  information 
is  known  about  the  structure  of  the  problem,  then  some 
form  of  a  search  algorithm  is  employed  (e.g.,  Simulated 
Annealing  [1],  Genetic  Algorithms  [11]).  When  the  sys¬ 
tem  operates  in  a  stochastic  environment  (e.g.,  in  a 
resource  allocation  setting  users  request  resources  at 
random  time  instants  or  hold  a  particular  resource  for 
a  random  period  of  time)  and  no  closed-form  expres¬ 
sion  for  Eu[Ld(r,u))]  is  available,  the  problem  is  fur¬ 
ther  complicated  by  the  need  to  estimate  Eu[Ld(r,  w)]. 
This  generally  requires  Monte  Carlo  simulation  or  di¬ 
rect  measurements  made  on  the  actual  system. 

While  the  area  of  stochastic  optimization  over  con¬ 
tinuous  decision  spaces  is  rich  and  usually  involves 
gradient-based  techniques  as  in  several  well-known 
stochastic  approximation  algorithms  [13], [15],  the  lit¬ 
erature  in  the  area  of  discrete  stochastic  optimization 
is  relatively  limited.  Most  known  approaches  are  based 
on  some  form  of  random  search  (e.g.,  [16], [8])  or,  more 
recently,  the  use  of  the  ordinal  optimization  approach 
presented  in  [9].  For  a  class  of  resource  allocation  prob¬ 
lems  of  the  form  (1),  an  approach  of  this  type  was  used 
in  [3].  Even  though  the  approach  in  [3]  yields  a  fast  re¬ 
source  allocation  algorithm,  it  is  still  constrained  to  it¬ 
erate  so  that  every  step  involves  the  transfer  of  no  more 
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than  a  single  resource  from  one  user  to  some  other  user. 
One  can  expect,  however,  that  much  faster  improve¬ 
ments  can  be  realized  in  a  scheme  allowed  to  reallocate 
multiple  resources  from  users  whose  cost-sensitivities 
are  small  to  users  whose  sensitivities  are  much  larger. 
With  this  motivation  in  mind,  it  is  reasonable  to  pose 
the  following  question:  Is  it  possible  to  transform  a  dis¬ 
crete  optimization  problem  as  in  (1)  into  a  “surrogate” 
continuous  optimization  problem,  proceed  to  solve  the 
latter  using  standard  gradient-based  approaches,  and 
finally  transform  its  solution  into  a  solution  of  the  orig¬ 
inal  problem?  Moreover,  is  it  possible  to  design  this 
process  for  on-line  operation?  That  is,  at  every  it¬ 
eration  step  in  the  solution  of  the  surrogate  continu¬ 
ous  optimization  problem,  is  it  possible  to  immediately 
transform  the  surrogate  continuous  state  into  a  feasible 
discrete  state  r?  This  is  crucial,  since  whatever  infor¬ 
mation  is  used  to  drive  the  process  (e.g.,  sensitivity 
estimates)  can  only  be  obtained  from  a  sample  path  of 
the  actual  system  operating  under  r. 

In  this  paper,  we  transform  the  original  discrete  set  Ad 
into  a  continuous  set  over  which  a  “surrogate”  opti¬ 
mization  problem  is  defined  and  subsequently  solved. 
As  in  earlier  work  in  [3],  [4]  and  unlike  algorithms  pre¬ 
sented  in  [12],  an  important  feature  of  our  approach  is 
that  every  state  r  in  the  optimization  process  remains 
feasible,  so  that  our  scheme  can  be  used  on  line  to  ad¬ 
just  the  decision  vector  as  operating  conditions  (e.g., 
system  parameters)  change  over  time.  Thus,  at  every 
step  of  the  continuous  optimization  process,  the  con¬ 
tinuous  state  obtained  is  mapped  back  into  a  feasible 
discrete  state;  based  on  a  realization  under  this  feasible 
state,  new  sensitivity  estimates  are  obtained  that  drive 
the  surrogate  problem  to  yield  the  next  continuous 
state.  The  proposed  scheme,  therefore,  involves  an  in¬ 
terplay  of  sensitivity-driven  iterations  and  continuous- 
to-discrete  state  transformations.  The  key  issue  then  is 
to  show  that  when  (and  if)  an  optimal  allocation  is  ob¬ 
tained  in  the  continuous  state  space,  the  transformed 
discrete  state  is  in  fact  r*  in  (1). 


2  Basic  approach 

In  the  sequel,  we  shall  adopt  the  following  notational 
conventions.  We  shall  use  subscripts  to  indicate  com¬ 
ponents  of  a  vector  (e.g.,  r»  is  the  ith  component  of 
r).  We  shall  use  superscripts  to  index  vectors  belong¬ 
ing  to  a  particular  set  (e.g.,  r3  is  the  jth  vector  of  the 
same  form  as  r  within  a  subset  of  Ad  that  contains  such 
vectors).  Finally,  we  reserve  the  index  n  as  a  subscript 
that  denotes  iteration  steps  and  not  vector  components 
(e.g.,  r„  is  the  value  of  r  at  the  nth  step  of  an  iterative 
scheme,  not  the  nth  component  of  r). 


One  common  method  for  the  solution  of  this  problem  is 
to  relax  the  integer  constraint  on  all  rj  so  that  they  can 
be  regarded  as  continuous  (real-valued)  variables  and 
then  apply  standard  optimization  techniques  such  as 
gradient-based  algorithms.  The  resulting  “surrogate” 
problem  then  is  to  find  p*  £  Ac  so  that 

Jc{p*)  =  min  Jc(p)  =  min  Eu[Lc(p,u>)}  (2) 

p€Ac  peAc 

where  p  £  M+  is  a  real-valued  state,  Ac  is  the  con¬ 
vex  hull  of  the  original  constraint  set  Ad,  and  Lc(p,w) 
is  the  cost  function  over  a  specific  sample  path  (de¬ 
noted  again  by  u>)  when  the  state  is  p.  Assuming  an 
optimal  solution  p*  can  be  determined,  this  state  must 
then  be  mapped  back  into  a  discrete  vector  by  some 
means  (usually,  some  form  of  truncation).  Even  if  the 
final  outcome  of  this  process  can  recover  the  actual 
r*  in  (1),  this  approach  is  strictly  limited  to  off-line 
analysis:  When  an  iterative  scheme  is  used  to  solve 
the  problem  in  (2)  (as  is  usually  the  case  except  for 
very  simple  problems  of  limited  interest),  a  sequence 
of  points  { pn }  is  generated;  these  points  are  generally 
continuous  states  in  Ac,  hence  they  may  be  infeasible 
in  the  original  discrete  optimization  problem.  More- 
over,  if  one  has  to  estimate  Eu[Lc\p)W)\  or  3^7  a 
through  simulation,  then  a  simulated  model  of  the  sur¬ 
rogate  problem  must  be  created,  which  is  also  not  gen¬ 
erally  feasible.  If,  on  the  other  hand,  the  only  cost 
information  available  is  through  direct  observation  of 
sample  paths  of  an  actual  system,  then  there  is  no  obvi¬ 
ous  way  to  estimate  Eu[Lc(p,u>)\  or  9Ew<[  ffpP'  ^  >  since 
this  applies  to  the  real-valued  p,  not  the  actual  cost 
observable  under  integer- valued  r. 

In  this  paper  we  propose  a  different  approach  intended 
to  operate  on  line.  In  particular,  we  still  invoke  a  re¬ 
laxation  such  as  the  one  above,  i.e.,  we  formulate  a 
surrogate  continuous  optimization  problem  with  some 
state  space  Ac  C  K+  and  Ad  C  Ac.  However,  at  ev¬ 
ery  step  n  of  the  iteration  scheme  involved  in  solving 
the  problem,  the  discrete  state  is  updated  through  a 
mapping  of  the  form  rn  =  fn{pn)  38  Pn  ls  updated 
using  a  stochastic  approximation  algorithm.  This  has 
two  advantages:  First,  the  cost  of  the  original  system 
is  continuously  adjusted  (in  contrast  to  an  adjustment 
that  would  only  be  possible  at  the  end  of  the  surro¬ 
gate  minimization  process);  and  second,  it  allows  us  to 
make  use  of  information  typically  used  to  obtain  cost 
sensitivities  from  the  actual  operating  system  at  every 
step  of  the  process.  It  is  important  to  note  that  {r„} 
corresponds  to  feasible  realizable  states  based  on  which 
one  can  evaluate  sensitivity  estimates  from  observable 
data,  i.e.,  a  sample  path  of  the  actual  system  under  r„ 
(not  the  surrogate  state  pn).  We  can  therefore  see  that 
this  scheme  is  intended  to  combine  the  advantages  of 
a  stochastic  approximation  type  of  algorithm  with  the 
ability  to  obtain  sensitivity  estimates  with  respect  to 
discrete  decision  variables.  In  particular,  sensitivity 
103 


estimation  methods  for  discrete  parameters  based  on 
Perturbation  Analysis  (PA)  and  Concurrent  Estima¬ 
tion  [10], [2]  are  ideally  suited  to  meet  this  objective. 

3  Continuous-to-discrete  state 

transformations 

In  the  sequel  we  will  assume  that  Ac  D  hN  =  Ad  where 
Ac  is  the  convex  hull  of  Ad]  that  is,  all  the  discrete 
allocations  contained  in  Ac  are  feasible.  Given  a  vector 
p  e  r£,  we  begin  by  specifying  a  set  Fp  of  mappings 
f(p).  To  do  so,  first  define 

Ip  —  {*  I  Pi  e  z+)  (3) 

to  be  the  set  of  components  of  p  (i.e.,  user  indices)  that 
are  strictly  integer.  Next,  define 

</,+M./fM>  =  {  (4) 

where,  for  any  x  €  R,  far]  and  [xj  denote  the  ceiling 
(smallest  integer  >  x)  and  floor  (largest  integer  <  x) 
of  x  respectively.  Then,  let 

FP  =  {f  \f  :  Ac ->  Ad,\i  fi(p)  €  {f+{p)t  /f  (p)» 

f  €  Fp  transforms  continuous  state  vector  p  €  Ac  into 
a  “neighboring”  discrete  state  vector  r  €  Ad  obtained 
by  seeking  [p/|  or  [pj  for  each  component.  Note  that 
for  all  /  £  Fp  and  r  €  Ad,  f(r )  =  r. 

Definition  1  The  set  of  all  feasible  discrete  neighbor¬ 
ing  states  of  p  €  Ac  is: 

A f{p)  =  {r  |  r  =  f{p)  for  some  f  e  Fp]  (5) 

Although  much  of  the  ensuing  analysis  applies  to  any 
constraint  set  Ac ,  we  shall  limit  ourselves  to  the  com¬ 
mon  case  of  a  total  resource  capacity  constraint: 

Ac  =  jp  :  =  JT  j  (6) 

In  this  case,  a  more  explicit  and  convenient  charac¬ 
terization  of  the  set  A f{p)  is  possible  by  defining  the 
residual  vector  p  £  [0,  l)w  of  the  continuous  state  p, 
given  by  p  =  p  -  [pj  where  |_pj  is  the  vector  whose 
components  are  [pjj  =  |_pj  •  Then,  set 

™p  =  ~  LPJ)  LftJ  (7) 

»= l  »=i  t=i 

and  note  that  mp  £  Z+  is  an  integer  with  the  fol¬ 
lowing  convenient  interpretation:  If  all  users  are  as¬ 
signed  LpJ  resources,  then  mp  is  the  total  residual  re¬ 
source  capacity  to  be  allocated.  Recalling  the  defini¬ 
tion  of  the  set  Ip  in  (3),  let  q  =  \IP\  be  the  number 


of  components  of  p  with  strictly  integer  values.  Then, 
mpe{0,...,N-q-l}. 

Fp  may  be  interpreted  as  a  set  of  mappings  that  allo¬ 
cate  mp  residual  resources  over  all  i  $  Ip  in  addition  to 
a  fixed  integer  [pj  already  assigned  to  i.  Let  us  then 
define  r^(p)  £  {0,1} N  to  be  the  }th  residual  discrete 
vector  corresponding  to  some  given  p  which  satisfies 
J2iLi  H  =  mP  rj  =  0  for  i  £  Ip.  Thus,  rJ  (p)  is 
an  //-dimensional  vector  with  components  0  or  1  sum¬ 
ming  up  to  77ip.  It  is  easy  to  see  that  the  number  of 
such  distinct  vectors,  and  hence  the  cardinality  of  the 
set  A f(p),  is  (Nnfg).  ^  follows  that  we  can  write,  for  all 
fi  £  p  , 

P{P)  =  [pj  +  P(P)  (8) 

The  following  theorem  establishes  the  fact  that  any  p  € 
Ac  can  be  expressed  as  a  convex  combination  of  points 
r  G  A f(p).  All  proofs  are  omitted  but  may  be  found  in 

Pi- 

Theorem  3.1  Any  p  €  Ac  is  a  convex  combination  of 
its  discrete  feasible  neighboring  states,  i.e.,  there  exists 
a  vector  a  such  that 

M  M 

p=]T  ajrj ,  with  ctj  =  1,  aj  >  0  Vj  =  1, ..,  M 
i= i  i=i 

where  M  =  lA/Xp)!  and  r *  6  AA(p),  j  =  1, . . . ,  M. 

This  result  asserts  that  every  p  €  Ac  belongs  to 
conv(M(p)),  the  convex  hull  of  the  feasible  neighboring 
state  set  M[p)  defined  in  (5).  We  can  further  limit  the 
set  of  states  over  which  such  a  convex  combination  can 
be  formed  as  follows. 

Corollary  3.1  Any  p  £  Ac  is  a  convex  combination  of 
at  most  N  —  q  discrete  feasible  neighboring  states,  i.e., 
there  exists  a  vector  a  such  that  for  all  j  =  1, ..,  N 

N—q  N—q 

p  =  ^  otjri  with  aj  =  1,  aj  >  0  (9) 

j=i  j= l 

where  rj  £  A f{p),  j  =  1, . . . ,  |A/(p)|,  q  =  \IP\. 

Definition  2  A/jv-?(p)  is  a  subset  of  Af(p)  that  con¬ 
tains  N-q  (with  q  =  \Ip\)  linearly  independent  discrete 
neighboring  states  whose  convex  hull  includes  p. 

The  existence  of  this  set  is  guaranteed  by  the  previ¬ 
ous  corollary  and  it  plays  a  crucial  role  in  our  ap¬ 
proach,  because  the  mapping  /„(pn)  will  be  an  element 
of  A/jv-?(pn)-  Therefore,  it  is  important  to  be  able  to 
identify  N-q  elements  of  A f{p)  that  satisfy  (9),  and 
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hence  determine  M^-q(pn)-  The  Simplex  Method  of 
Linear  Programming  (LP)  is  suitable  for  this  task. 

Given  this  “reduced”  set  of  discrete  feasible  neigh¬ 
bors  of  p ,  A rN-q(p),  we  restrict  our  original  set  of 
continuous-to-discrete  transformations  Fp  to 

*■,  =  {/:  M  €  tfN-q(p)}  (10) 

Note  that  when  the  continuous  state  is  pn,  the 
continuous-to-discrete  mapping  /„  will  be  an  element 

°f  ?Pn- 


4  Construction  of  surrogate  cost 
functions  and  their  gradients 


In  the  sequel,  ‘w’  will  be  dropped  from  Ld(r,u)  and 
from  the  “surrogate”  cost  function  Lc(p,uj)  in  (2). 
Moreover,  unless  otherwise  noted,  all  costs  will  be  over 
the  same  sample  path.  Since  our  approach  is  based  on 
iterating  over  the  continuous  state  p  G  Ac,  yet  drive  the 
iteration  process  with  information  involving  Ld{r)  ob¬ 
tained  from  a  sample  path  under  r,  we  must  establish 
a  relationship  between  Ld(r )  and  Lc(p).  The  choice  of 
Lc(p)  is  rather  flexible  and  may  depend  on  information 
pertaining  to  a  specific  model  and  the  nature  of  the 
given  cost  Ld(r). 

As  seen  in  the  previous  section,  it  is  possible  that  some 
components  of  p  are  integers,  in  which  case  the  set  Ip 
is  non-empty  and  we  have  q  =  \IP\  >0.  In  order  to 
avoid  the  technical  complications  due  to  such  integer 
components,  let  us  agree  that  whenever  this  is  the  case 
we  will  perturb  these  components  to  obtain  a  new  state 
p  such  that  Ip  =  0.  In  what  follows,  we  will  assume 
that  all  states  p  either  have  Ip  =  0  or  have  already 
been  perturbed  and  relabeled  p.  Since  q  is  going  to  be 
zero,  we  will  rename  A f^-q(p)  as  Mn (p)  . 

We  shall  select  a  surrogate  cost  function  Lc{p)  to  satisfy 
the  following  two  conditions: 

(Cl):  Consistency.  Lc{r )  —  Ld(r)  for  all  r  G  Ad- 

(C2):  Piecewise  Linearity:  Lc(p)  is  a  linear  function 
of  p  over  cotiv(Mn(p)). 


Consistency  is  an  obvious  requirement  for  Lc(p).  Piece- 
wise  linearity  is  chosen  for  convenience,  since  manip¬ 
ulating  linear  functions  over  cotiv{Mn{,p))  simplifies 
analysis,  as  will  become  clear  in  the  sequel.  Given 
some  state  p  G  Ac  and  cost  functions  Ld(rj)  for  all 
rj  G  MN(p),  it  follows  from  (C2)  and  (9)  in  Corollary 
3.1  that  we  can  write 


N 

Lc{p )  =  ^Oj-Mr7) 


j= i 


with  aj  =  1,  aj  >  0  for  all  j  =  1, N.  Moreover, 
by  (Cl),  we  have 

N 

Lc{p)  =  Y,aMrj)  (12) 

3= 1 

that  is,  Lc(p)  is  a  convex  combination  of  the  costs  of 
N  discrete  feasible  neighbors.  Next,  in  order  to  use  a 
stochastic  approximation  algorithm,  we  need  sensitiv¬ 
ity  information  provided  through  the  sample  gradient 
VLc(p)  expressed  in  terms  of  directly  observable  sam¬ 
ple  path  data. 

Since  Lc(p)  is  a  linear  function  on  the  convex  hull  de¬ 
fined  by  the  N  discrete  neighbors  in  (12),  one  can  write 

N 

Lc(p)  =  Y^PiPi  +  P  o  (13) 

i=l 

for  some  G  R,  i  =  0,  Moreover,  due  to  the 

linearity  of  Lc(p)  in  conv{Mt<{p )),  we  have 


For  any  discrete  feasible  neighboring  state  G  Nn(p), 
one  can  use  (13)  and  (Cl)  to  write 


N 

=  +  3  =  1 . N  (15) 

i=l 


Letting  VLc(p)'  =  [01,...,PN]  be  the  gradient  of 
Lc{p),  our  objective  is  to  obtain  an  expression  for 
Pi,..., (not  P0)  in  terms  of  Ld(rJ).  Note  that 
Ld{ri)  are  costs  that  can  be  evaluated  at  feasible  states 
r7  G  MN(p).  These  may  be  obtained  by  direct  simula¬ 
tion;  however,  they  are  not  available  if  a  system  is  op¬ 
erating  on  line  under  one  of  these  states,  say  r1 .  This 
is  where  techniques  such  as  Concurrent  Estimation  and 
Perturbation  Analysis  mentioned  earlier  can  be  used  to 
facilitate  this  task. 


To  obtain  expressions  for  Px , . . . ,  PN  in  terms  of  Ld{rj), 
let  r1  be  the  current  state  of  the  system  (without  loss 
of  generality),  and  define 


f  -1  if  r-  >  r{ 

^=rj-rj=  1  if  r]  <  r\  (16) 

(  0  otherwise 

and 

ALjA  =  Ld(rj)  -  Ldir1)  (17) 

for  all  i  =  1, . . . , N  and  j  =  2, . . . ,  N.  Using  (15),  the 
last  equation  can  be  rewritten  as 


(11) 
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5  Optimization  Algorithm 


If  all  L<t{ri)  in  (17)  are  observable,  then  (18)  provides 
TV  —  1  linearly  independent  equations  for  the  TV  vari¬ 
ables  Plt...,PN.  An  additional  equation  is  obtained 
as  follows:  In  the  stochastic  approximation  algorithm 

Pn+ 1  =  7Tn+l  [pn  -  T)n VLc(pn)}  (19) 

let  pn+1  =  pn  -  7?nV Lc(pn)  be  an  “intermediate”  state 
prior  to  applying  the  projection  ^n+\-  In  order  to  force 
Pn+\  to  satisfy  the  total  capacity  constraint  (6),  i.e. , 


N  N  N 


dLciPn) 

Pi 


=  K  (20) 


we  require  that 


Summarizing  the  results  of  the  previous  sections  and 
combining  them  with  the  stochastic  approximation  al¬ 
gorithm,  we  obtain  the  following  optimization  algo¬ 
rithm  for  the  solution  of  the  basic  problem  in  (1).  After 
initializing  p0  =  ro,  for  each  iteration  n  =  0, 1, . . ., 

1.  Perturb  pn  so  that  IPn  =0. 

2.  Determine  A/”(pn)  [using  (4)-(5)]. 

3.  Determine  A/)v(pn)  [using  the  Simplex  Method]. 


O=£ft=0  (21) 

t=l  Pi  »=1 

The  combination  of  (18)  and  (21)  provides  TV  equa¬ 
tions  used  to  determine  unique  P1,...,PN.  Specifi¬ 
cally,  define  the  (TV  -  l)-dimensional  vector  A V  — 
[AL2,i,...,ALN1]  whose  components  were  defined 
in  (17),  and  the  (TV  —  1)  x  TV  matrix  AR  = 
S2'1, ...  ,  J^’1]'  whose  rows  are  the  vectors  6*’1  = 

•  • ,  <5^1  as  defined  in  (16).  We  then  get  from 
(18)  and  (21) 

VLc(p)=  X[A0L]  (22) 

Therefore,  VLc(p),  the  sample  gradient  to  be  used  as 
an  estimate  of  V  Jc(p)  in  (2),  is  obtained  through  the 
TV  costs  Ld(rl), . . .  ,Ld(rN).  The  sample  path  at  our 
disposal  corresponds  to  one  of  the  state  vectors,  which 
we  have  taken  to  be  r1  €  A//v(p),  so  that  Ld{rl)  is  ob¬ 
servable;  the  remaining  TV  —  1  costs  therefore  need  to 
be  obtained  by  some  other  means.  One  possibility  is  to 
perform  TV  —  1  simulations,  one  for  each  of  these  states. 
This,  however,  is  not  attractive  for  an  on-line  method¬ 
ology.  Fortunately,  there  exist  several  techniques  based 
on  Perturbation  Analysis  (PA)  [10], [2]  or  Concurrent 
Estimation  [5],  which  are  ideally  suited  for  this  pur¬ 
pose;  that  is,  based  on  observations  of  a  sample  path 
under  r‘,  one  can  evaluate  Ld(rJ)  for  states  ^  r* 
with  limited  extra  effort.  The  efficiency  of  these  tech¬ 
niques  depends  on  the  nature  of  the  system  and  cost 
function.  Systems  with  separable  cost  functions,  i.e. 

N 

Ld(r)  =  J^Ld,i(n)  (23) 

»=1 

is  one  area  where  PA  techniques  prove  to  be  particu¬ 
larly  efficient.  In  such  systems,  one  can  write 
N 

Lc(p)  =  Ld{rl )  +  A LdAr1)  \ Pi  ~r\  |  (24) 

*=1 

where  A Ld,i(rl)  is  the  change  in  cost  by  adding  or  re¬ 
moving  (depending  on  the  sign  of  pi  -  rf)  a  resource 
from  the  zth  user  (see  [7]). 


4.  Select  /„  E  TPn  such  that  rn  =  /n(p„)  = 
argminr6^(pj||r-pJ. 

5.  Collect  Ld{r *)  for  all  r1  €  A /w(pn)  [using  Concur¬ 
rent  Estimation  or  Perturbation  Analysis]. 

6.  Evaluate  VLc(pn)  [using  (22)]. 

7.  Update  state:  pn+1  =  7rn+1[pn  -  r)nVLc(pn)\. 

8.  If  some  stopping  condition  is  not  satisfied,  repeat 
steps  for  n  +  1.  Else,  set  p*  =  pn+l . 

We  finally  obtain  r*  as  one  of  the  neighboring  feasible 
states  in  the  set  A/)v(p*)-  It  is  shown  in  [7]  using  tech¬ 
niques  similar  to  [6]  that  under  certain  conditions  {pn}, 
with  p0  G  Ac  (initial  condition)  arbitrary,  converges  to 
p*  with  probability  1.  The  following  result  establishes 
that  we  can  then  determine  r*  E  Ad  that  solves  the 
optimization  problem  (1). 

Theorem  5.1  Let  p *  minimize  Lc(p )  over  Ac.  Then, 
there  exists  a  discrete  feasible  neighboring  state  r *  E 
Nn(p*)  which  minimizes  Ld(r)  over  Ad  and  satisfies 
Ld{r')  =  Lc(p*) 

6  Numerical  Example 

We  illustrate  our  approach  by  means  of  a  stochastic  op¬ 
timization  application  for  a  classic  problem  in  manufac¬ 
turing  systems.  Consider  a  kanban-based  manufactur¬ 
ing  system  where  15  kanban  (resources)  are  allocated 
to  3  servers  (users)  in  series.  The  objective  is  to  find 
the  optimal  allocation  r*  that  minimizes  the  Average 
Cycle  Time  (ACT),  defined  as  the  time  between  two 
job  completions  at  the  last  server  (this  is  equivalent  to 
a  throughput  maximization  problem).  The  arrival  pro¬ 
cess  is  Poisson  with  rate  A  =  1.6.  The  service  times 
of  the  servers  are  exponentially  distributed  with  rates 
=  2.0,  /x2  =  1.6,  p3  —  3.0.  In  this  case,  we  chose 
the  step  size  to  be  constant  at  tj  =  100,  while  the  ob¬ 
servation  intervals  are  increased  in  length.  The  system 
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started  with  an  initial  allocation  ro  =  [3, 5, 7]'  and  the 
algorithm  performed  as  follows: 


#  jobs 

P' 

r' 

ACT 

0 

■ 

1 

3,5,7 

0.830316 

100 

7.44,2.69,4.87 

1 

0.743377 

300 

I 

1 

1 

0.737032 

600 

7.81,3.60,3.59 

8,4,3 

0.738288 

2100 

1 

1 

7,5,3 

0.720974 

25300 

6.97,5.64,2.39 

7,6,2 

0.724723 

32500 

6.75,5.56,2.69 

I 

0.720611 

74100 

1 

1 

1 

0.712891 

316000 

1 

6.61,5.52,2.87 

1 

7,5,3 

0.722122 

Due  to  noise,  r  oscillates  between  three  allocations, 
namely  [7, 5, 3]',  [6, 6, 3]'  and  [7, 6, 2]'.  Using  brute-force 
simulation,  it  was  determined  that  these  are  the  best 
three  allocations  with  corresponding  performance  val¬ 
ues  which  are  very  close  to  each  other. 


7  Conclusions  and  Future  Work 

A  key  contribution  of  the  proposed  surrogate  problem 
methodology  is  its  on-line  control  nature,  based  on  ac¬ 
tual  data  from  the  underlying  system.  One  can  there¬ 
fore  see  that  this  approach  is  intended  to  combine  the 
advantages  of  a  stochastic  approximation  type  of  algo¬ 
rithm  with  the  ability  to  obtain  sensitivity  estimates 
with  respect  to  discrete  decision  variables.  This  com¬ 
bination  leads  to  very  fast  convergence  to  the  optimal 
point,  as  illustrated  in  Section  6.  It  appears,  therefore, 
feasible  to  apply  this  approach  to  problems  with  local 
extremal  points  by  developing  a  procedure  that  allows 
the  algorithm  to  operate  from  multiple  initial  states  in 
an  effort  to  determine  a  global  optimum. 

Two  issues  that  are  the  subject  of  ongoing  research  are 
(a)  application  of  this  approach  to  systems  with  dif¬ 
ferent  types  of  constraint  sets,  other  than  the  capacity 
constraint  in  (6) ,  and  (b)  the  effect  of  using  the  Simplex 
method  to  determine  the  set  A/a r-q(p)  on  the  gradient 
estimation. 
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ABSTRACT 

Simulation  of  large  complex  systems  for  the  purpose  of  evaluating  performance  and  exploring  alternatives  is  a  computationally  slow 
process,  currently  still  out  of  the  domain  of  real-time  applications.  To  overcome  this  limitation,  one  approach  is  to  obtain  a 
"metamoder  of  the  system,  i.e.,  a  “surrogate”  model  which  is  computationally  much  faster  than  the  simulator  and  yet  is  just  as 
accurate.  We  describe  the  use  of  Neural  Networks  (NN)  as  metamodeling  devices  which  may  be  trained  to  mimic  the  input-output 
behavior  of  a  simulation  model.  In  the  case  of  Discrete  Event  System  (DES)  models,  the  process  of  collecting  the  simulation  data 
needed  to  obtain  a  metamodel  can  also  be  significantly  enhanced  through  Concurrent  Estimation  techniques  which  enable  the  extraction 
of  information  from  a  single  simulation  that  would  otherwise  require  multiple  repeated  simulations.  We  will  present  applications  of 
two  benchmark  problems  in  the  C3I  domain:  A  tactical  electronic  reconnaissance  model  describing  the  flight  of  a  reconnaissance 
aircraft  carrying  a  bearing  angle  measuring  sensor  over  a  radar  field  in  order  to  detect  ground-based  radar  sites;  and  an  aircraft  refueling 
and  maintenance  system  as  a  component  of  a  typical  Air  Tasking  Order  (ATO).  A  comparative  analysis  with  alternative  metamodeling 
approaches  indicates  that  a  NN  captures  significant  nonlinearities  in  the  behavior  of  complex  systems  that  may  otherwise  not  be 
accurately  modeled. 

Keywords:  Metamodeling,  neural  networks,  concurrent  estimation,  discrete  event  systems,  decision  making 

1.  INTRODUCTION 

Simulation  is  widely  recognized  as  one  of  the  most  versatile  and  general-purpose  tools  available  today  for  modeling  complex  processes 
and  solving  problems  in  design,  performance  evaluation,  decision  making,  and  planning.  This  includes  C3I  environments,  where  most 
problems  confronted  by  designers  and  decision  makers  are  of  such  complexity  that  their  analysis  and  solution  far  surpass  the  scope  of 
available  analytical  and  numerical  methods;  this  leaves  simulation  as  the  only  alternative  of  “universal”  applicability.  The  ultimate 
purpose  of  simulation  is  often  system  performance  evaluation  and  optimization.  Typically,  this  involves  the  use  of  simulation  to 
explore  a  multitude  of  “what  if’  scenaria.  However,  simulation  is  notoriously  computer  time-consuming  when  it  comes  to  parametric 
studies  of  system  performance.  Unless  substantial  speedup  of  the  performance  evaluation  process  can  be  achieved,  systematic 
performance  studies  of  most  real-world  problems  are  beyond  reach,  even  with  supercomputers. 

One  way  to  achieve  a  speedup  is  through  metamodeling.  The  main  idea  of  metamodeling  is  to  extract  as  much  information  from 
simulation  as  possible  and  use  it  to  build  a  surrogate  model  of  the  system  of  interest  which  is  much  simpler  (yet  accurate)  to  work 
with.  This  is  essentially  analogous  to  constructing  a  function  F(jt,  from  observations  of  the  simulation  output  under  a  few 

selected  combinations  of  simulation  inputs  jc,  The  problem,  of  course,  is  that  the  actual  function  we  are  trying  to  approximate 

with  Flx^.-.jc^)  is  unknown.  The  most  common  approach  is  to  try  and  build  a  polynomial  expression.  This  is  often  inadequate 
because  if  the  actual  curve  includes  sudden  jumps  and  asymptotic  behavior  (which  is  often  the  case  in  practice),  then  polynomial  fits  to 
such  curves  are  known  to  be  poor.  Thus,  obtaining  a  metamodeling  device  of  “universal”  applicability,  i.e.,  one  capable  of  generating 
functions  of  virtually  arbitrary  complexity,  is  needed.  This  paper  explores  neural  networks  as  offering  this  capability  and  is  intended  to 
investigate  the  advantages  and  limitations  of  this  approach. 

Training  neural  networks  for  function  approximation  tasks  typically  requires  a  large  number  of  training  pairs,  each  under  a  different 
scenario.  The  obvious  way  to  collect  N  training  pairs  is  to  perform  N  separate  simulations:  one  for  each  scenario.  If  a  typical 
simulation  run  takes  T  time  units,  this  procedure  requires  a  total  of  NT  time  units.  For  Discrete  Event  System  (DES)  simulation 
models,  concurrent  estimation  can  be  used.  In  concurrent  estimation,  the  objective  of  collecting  N  training  pairs  is  met  by  performing 


(1) 
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a  single  baseline  simulation,  but  endowing  it  with  the  capability  to  generate  AM  additional  simulations  concurrently.  This  is 
accomplished  by  exploiting  “intelligent  data  sharing”  which  results  in  a  total  simulation  time  of  (T+c)  «  NT,  where  c  represents  the 
overhead  corresponding  to  this  data  sharing.  Using  concurrent  estimation,  therefore,  reduces  the  simulation  time  needed  to  collect  the 
data  needed  to  train  a  neural  network  metamodel. 

The  remainder  of  this  paper  is  organized  as  follows.  Since  our  purpose  is  to  compare  polynomial  metamodels  to  neural  network 
metamodels,  Section  2  describes  the  basics  of  polynomial  and  neural  network  curve  fitting.  Section  2  also  briefly  reviews  concurren 
estimation  as  a  way  to  speed  the  process  of  collecting  the  simulation  data  needed  to  construct  metamodels.  Section  3  describes  the  to 
simulators  that  will  be  used  to  compare  the  two  metamodeling  approaches.  Section  4  presents  numerical  comparison  results.  The 
paper  ends  in  Section  5  with  a  conclusion  and  a  discussion  of  open  issues  and  ongoing  work. 

2.  NEURAL  NETWORKS  FOR  METAMODELING 

For  applications  involving  simulation,  the  time  required  to  run  the  simulation  may  be  very  long  or  it  may  be  nKtss^l0  J™ 
many  simulation  runs.  Metamodeling  addresses  these  issues:  If  one  can  develop  a  metamodel  that  captures  the  functional  input/ou  p 
relationships  embodied  by  the  simulator,  then  it  is  possible  to  obtain  the  simulation  data  quickly  and  efficiently. 


2.1.  Polynomial  Metamodels  . 

Traditionally,  polynomials  have  been  used  as  metamodels.  The  functional  mapping  represented  by  a  polynomial  is  determined  by  its 
order  and  the  values  of  its  coefficients.  As  an  example,  consider  the  2-input,  1-output,  second  order  polynomia 
v-b+bx+bx+bx  x  +  b>x]  +  b.xl  and  suppose  it  is  to  be  used  to  approximate  some  unknown  function  y  =  /(•*  i  .*2)-  1  ne  least 
squares  approach  gives  a  way  to  determine  the  coefficients  which  will  minimize  the  mean  squared  error  between  the  output  of  the 
polynomial  and  the  output  of  the  function  being  approximated.  In  the  above  example,  there  are  6  coefficients  to  determine.  Thus,  a 
least  6  experiments  must  be  conducted.  Suppose  data  has  been  collected  for  N  a  6  such  experiments.  Index  each  expenmen  as  n 
0  N  and  designate  the  inputs  for  the  n-th  experiment  as  x,(n)  and  x2(n),  the  output  of  the  polynomial  as  y(n) ,  and  the  target  output 

of"  the  function  being  approximated  as  y(n).  Next  define  F  =  [*(»).  K2) . 9(N)]  as  the  vector  of  pdynomia1  output, 

Y  =  [yfll  y(2),...,y(N)]T  as  the  vector  of  target  outputs,  E=Y-Y  as  the  error  between  the  target  output  and  the  polynomial  output, 
and  B  =  \b  b  ,  b  f  as  the  vector  of  polynomial  coefficients  to  be  determined.  The  objective  of  the  least  squares  approach  then  is  to 
determine  the  values  of  the  coefficients  which  will  minimize  the  mean  squared  error,  i.e.,  min[£  E]  .  The  solution  to  this  problem 
is  given  by  B’  =  (XrX)''  XTY  where  X  is  the  regressor  matrix  defined  by, 

n  *,o)  *,owi)  *jo) 


X  = 


1  Jt.(2)  JC,(2)  x.(2K(2)  *?(2)  xU  2) 


1  jc.(Af)  x2(N)  xXN)x2{N)  x^N)  xl(N)  J 

Note,  the  determination  of  the  coefficients  B  requires  inverting  the  matrix  ( XTX ).  The  size  of  this  matrix  is  determined  by  the 
number  of  coefficients  which  is  given  by  the  number  of  inputs  and  the  order  of  the  polynomial.  Since  it  can  be  difficult  to  invert  large 
matrices  the  least  squares  approach  does  not  scale  well  to  problems  with  many  inputs  and  high  order  polynomials.  In  addition,  the  A 
inputs  points  cannot  be  chosen  haphazardly  lest  the  matrix  be  poorly  conditioned  or  ,  even  worse,  singular. 


As  with  all  function  approximation  methods,  the  quality  of  the  approximator  depends  on  the  particular  set  of  N  input  points  used  to 
obtain  the  coefficients.  Since,  it  is  assumed  that  little  is  known  about  the  function  being  approximated,  selecting  the  best  set  of  input 
points  can  be  difficult.  Strategies  for  choosing  input  points  fall  under  the  general  heading  of  experimental  design.  For  first  and  second 
order  polynomials  there  are  well  established  experimental  designs  for  choosing  the  input  points.  For  first  order  polynomials,  the  most 
popular  are  the  orthogonal  designs,  so  called  because  they  result  in  a  diagonal  (X  X)  2.  These  designs  are  useful  because  they 
minimize  the  variance  of  the  coefficients  when  the  function  being  approximated  is  stochastic.  For  second  order  polynomials,  the  most 
common  design  is  the  central  composite  design  (CCD)2.  For  higher  order  polynomials,  design  methods  are  not  as  well  established, 
but  one  common  approach  is  to  use  layered  CCD’s. 


2.2.  Neural  Network  Metamodels 

It  is  well  known  that  multilayer  feedforward  neural  networks  are  universal  function  approximators3.  As  such,  they  are  often  used  for 
non-parametric  modeling,  and  are  well  suited  to  the  task  of  metamodeling.  Fig.  2.1  depicts  the  architecture  of  a  typical  multilayer 
feedforward  neural  network.  In  this  particular  figure  there  are  four  layers,  including  one  input  layer,  one  output  layer,  and  two 
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intermediate  layers  called  the  "hidden  layers."  Each  layer  consists  of  many  nonlinear  devices  called  “neurons."  The  output  of  each 
neuron,  V,,  is  called  its  "activation”,  and  is  a  nonlinear  function  of  a  weighted  sum  of  the  inputs  to  the  neuron  minus  a  threshold,  Q„ 

activation  =  g{^  ;  weights  x  inputs  -  threshold j 

The  form  of  the  function  g(  )  above  is  very  important  to  the  operation  of  a  neural  network.  It  is  usually  chosen  to  have  the  form  of 
g(x)  -  tanh{/3x)  and  is  called  a  “sigmoid  function.”  As  seen  in  the  figure,  the  neurons  are  connected  through  links  with  different 
coefficients  associated  with  each  link.  These  coefficients  are  the  “weights”  in  the  above  equation.  It  can  be  shown  that  if  there  are 
enough  neurons  in  the  hidden  layers  then  there  exists  a  set  of  weights  that  can  approximate  any  function  with  a  finite  number  of 
discontinuities  to  any  desired  degree  of  accuracy3. 


Arelhiiteetorii 

Figure  2.1.  Architecture  of  a  typical  multilayer  feedforward  neural  network. 


Typically,  the  weights  are  adjusted  using  a  “training”  algorithm.  The  usual  training  objective  is  to  adjust  the  weights  to  minimize  the 
mean  squared  error  between  a  desired  target  output  and  the  network  output  for  a  specific  set  of  inputs.  To  do  this,  a  special  algorithm 
known  as  the  “backpropagation  algorithm”  is  often  used.  The  backpropagation  algorithm,  although  quite  ingenious,  is  really  nothing 
more  than  an  application  of  the  chain  rule  of  calculus  to  effect  a  gradient  descent  in  the  weight  space.  For  a  representative  description 
of  the  algorithm  see4. 


Figure  2.2.  Neural  network  training  through  simulation  to  create  a  surrogate  model. 


The  process  of  constructing  a  metamodel  using  a  neural  network  is  illustrated  in  Fig.  2.2.  First,  a  large-scale  simulation  is  executed. 
While  the  simulator  is  running,  the  neural  network  observes  the  simulator  inputs  and  generates  its  own  prediction  of  what  the 
simulator  output  will  be.  It  compares  its  output  to  the  simulator  output  and  uses  the  error  to  adjust  its  weights.  In  this  way  the 
neural  network  "learns”  the  input-output  behavior  of  the  simulator.  One  can  visualize  the  neural  network  as  an  “entity”  which  is 
highly  "intelligent”  but  has  no  knowledge  of  anything  initially  to  apply  its  intelligence  to.  As  it  observes  the  simulation  unfold, 
however,  it  learns  from  the  basic  cause-effect  (i.e„  input-output)  relationships  it  observes.  In  fact,  all  the  neural  network  does  is  adjust 
its  weights  so  as  to  emulate  the  behavior  it  observes  as  closely  as  possible.  When  the  training  is  done,  the  simulator  can  be  taken 
away.  The  neural  network  is  now  a  surrogate  model:  we  give  it  some  inputs  (as  if  we  were  giving  them  to  the  simulator)  and  it 
immediately  gives  an  output  (as  the  simulator  would). 
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One  difficulty  when  using  neural  networks  is  choosing  the  number  of  neurons  to  use  in  the  hidden  layers.  Since  neural  networks 
approximate  functions  by  combining  sigmoids  (i.e„  hidden  layer  neurons),  the  number  of  hidden  layer  neurons  determines  the  ultimate 
complexity  of  the  functions  the  network  will  be  able  to  approximate.  Too  few  and  the  network  will  underfit  the  training  data,  too 
many  and  it  will  overfit  the  data.  In  either  case,  the  resulting  performance  will  be  less  than  satisfactory.  To  address  this  issue,  we  use 
the  Cascade  Correlation  Neural  Network  (CCNN),  which  is  a  type  of  multilayer  feedforward  neural  network  that  adds  hidden  layer 
neurons  as  it  learns5.  The  CCNN  starts  with  a  single  hidden  layer  neuron  and  adds  more  as  needed  during  training  which  usually  leads, 
not  only  to  faster  learning,  but  improved  generalization  performance.  A  detailed  description  of  the  CCNN  training  algorithm  used  for 
these  metamodeling  studies  can  be  found  in  Cassandras  and  Gong5. 

The  main  advantages  of  neural  networks  over  other  approaches  for  function  approximation  (e.g.,  polynomials  or  rational  functions) 
include:  (i)  Generality  Neural  networks  are  capable  of  approximating  virtually  any  function.  Polynomials  and  rational  functions,  in 
contrast,  have  known  limitations  in  terms  of  their  approximation  capabilities,  (ii)  Scalability  Neural  networks  scale  easily  as  the 
problem  size  and  complexity  increases.  When  the  system  dimension  is  in  the  hundreds  or  thousands,  mathematical  formulae  simply 
become  too  complicated  to  use.  In  fact,  determining  the  coefficients  for  a  polynomial  or  rational  function  can  be  completely  infeasible 
when  the  dimension  is  large,  (iii)  Inherent  Parallelism:  Although  the  number  of  weights  increases  with  the  number  of  inputs  and 
with  the  number  of  hidden  neurons,  the  increased  computational  burden  required  to  train  the  network  can  be  met  by  exploiting  the 
inherent  parallelism  of  neural  networks  (i.e.,  each  neuron  can  be  implemented  on  a  different  processor).  They  are,  therefore,  ideal  for 
modeling  large-scale  systems,  (iv)  Well-suited  for  model  sensitivity  analysis:  It  is  always  possible  to  do  model  sensitivity  analysis 
with  a  trained  neural  network,  as  long  as  the  related  factors  have  been  chosen  as  the  inputs  to  the  neural  network.  Thus,  there  is  an 
excellent  opportunity  here  to  combine  this  metamodeling  approach  with  concurrent/parallel  techniques  to  perform  model  sensitivity 
analysis  on  the  neural  network  surrogate  model. 


2.3.  Concurrent  Estimation 

Constructing  a  neural  network  metamodel  will  usually  require  a  large  number  of  training  points  to  achieve  adequate  performance.  For 
Discrete  Event  System  (DES)  simulations,  a  technique  called  Concurrent  Estimation  can  reduce  the  simulation  time  required  to  collect 
the  necessary  training  data.  The  goal  of  concurrent  estimation  is  the  following:  From  a  single  simulation,  obtain  performance 
results  for  several  different  values  of  the  inputs.  The  main  idea  behind  the  approach  is  to  observe  the  evolution  of  a  single  (nominal) 
sample  path  of  a  DES  as  it  operates  under  some  parameter.  As  the  nominal  sample  path  evolves,  observed  data  (e.g.,  event 
occurrences  and  their  corresponding  occurrence  times)  are  processed  to  concurrently  construct  the  set  of  sample  paths  that  would  have 
resulted  if  the  system  had  operated  under  a  set  of  different  (hypothetical)  parameters.  Using  these  “concurrently  constructed  sample 
paths,  it  is  possible  to  “concurrently  estimate”  the  corresponding  performance  measures.  Thus,  from  a  single  simulation  run  training 
data  can  be  collected  which  would  otherwise  require  multiple  runs  to  collect,  therefore  speeding  up  the  process  of  data  collection. 


Figure  2.3.  Concurrent  Estimation. 


To  explain  the  essentials  of  concurrent  estimation,  consider  a  DES  and  a  finite  discrete  parameter  set  0  =  (0, . Qm),  where  each 

parameter  6J  e  0,  j  =  l,...,m  is  in  general  vector-valued.  Suppose  the  sample  path  generated  by  the  DES  is  a  function  of  the 
parameter  6  ,  and  designate  the  sample  path  generated  under  parameter  0.  by  the  sequence  of  pairs  {e* ,  ] ,  where  k  =  1 ,2,...  is  an 
event-counting  index,  e  is  the  k- th  event,  and  tk  is  the  occurrence  time  of  the  *-th  event.  Now,  assume  that  the  DES  is  operating 
under  0,  and  that  all  events  and  event  times  e\,t[  for  k  =  1,2 .  are  directly  observable.  The  problem,  then,  is  to  use  the 
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observations  of  the  sample  path  to  construct  the  sample  paths  {eJk,tJk  },  k  =  1,2,...,  for  any  6),  j  =  2  ,.,.,/n.  This  problem  of 

concurrently  constructing  multiple  sample  paths  is  referred  to  as  the  sample  path  constructability  problem1 .  Note,  any  sample 
performance  metric  L(  0 )  is  obtained  as  a  function  of  the  corresponding  sample  path  j  1.2,. ..  as  shown  in  Fig.  2.3.  Here 

we  are  careful  to  distinguish  between  L(8),  the  performance  obtained  over  a  specific  sample  path  of  the  system  and  E\L(9) ],  the 
expectation  over  all  possible  sample  paths.  The  solution  to  the  sample  path  constructability  problem,  if  it  exists,  therefore  enables  us 
to  learn  about  the  behavior  of  a  DES  under  all  possible  parameter  values  in  ©from  a  single  "trial”,  i.e.,  a  single  sample  path  obtained 
under  one  parameter  value.  For  DES  in  which  all  event  processes  are  Markovian  (memoryless),  the  Standard  Clock  scheme8,  and 
Augmented  System  Analysis  (ASA)9' 1 0  provide  two  very  efficient  ways  to  obtain  concurrent  estimates.  The  Time  Warping 
Algorithm  (TWA)1  while  not  as  efficient  as  the  other  two,  is  a  general-purpose  scheme  for  DES  with  arbitrary  event  lifetime 
distributions. 

A  measure  of  the  effectiveness  of  a  concurrent  estimation  scheme  is  the  “speedup  factor”  measuring  how  much  faster  one  can  obtain 
performance  information  for  N  concurrently  constructed  sample  paths  compared  to  N  individual  "brute  force”  simulation  experiments. 
Typical  numerical  results  for  ASA  give  speedup  factors  of  the  order  of  100,  while  for  TWA  factors  of  2-20  or  more  are  common7' 1  1 
(the  more  complex  the  event  lifetime  probability  distributions  in  a  model,  the  greater  the  speedup).  It  is  worth  pointing  out  that  if  a 
parallel  processing  environment  is  used,  the  speedup  factor  becomes  much  greater  by  several  orders  of  magnitude. 

3.  TWO  BENCHMARK  SIMULATION  MODELS 

To  evaluate  the  efficacy  of  using  the  CCNN  for  metamodeling  we  performed  numerical  experiments  using  two  different  models,  a 
Tactical  Electronic  Reconnaissance  Simulation  (TERSM)  describing  the  flight  of  a  reconnaissance  aircraft  carrying  a  bearing  angle 
measuring  sensor  over  a  radar  field  in  an  effort  to  locate  ground-based  radar  sites,  and  an  Aircraft  Refueling  and  Maintenance  System 
(ARMS)  which  models  a  component  of  a  typical  Air  Tasking  Order  (ATO). 

3.1.  Tactical  Electronic  Reconnaissance  Simulator  (TERSM) 

TERSM.  developed  by  the  RAND  corporation  for  the  United  States  Air  Force1 2,  is  a  very  complex  simulator  that  models  the  flight  of 
a  reconnaissance  aircraft  carrying  a  bearing  angle  measuring  sensor  over  a  radar  field.  As  shown  in  Fig.  3.1,  the  aircraft  flies  with  a 
fixed  heading  at  a  constant  altitude  and  a  constant  velocity  over  a  battlefield  which  contains  many  ground-based  radar  sites  (emitters) 
As  the  aircraft  flies,  the  sensor  records  bearing  angle  measurements  of  the  emitters  it  detects  and  builds  a  list  containing  the  circular 
error  probable  (CEP)  for  each  one.  The  CEP  is  a  disk  of  such  size  that  the  probability  that  the  emitter  is  inside  the  disk  is  50%.  The 
simulator  was  originally  used  to  compare  competing  sensors  when  making  purchase  decisions.  Recently,  TERSM  has  been  used  to 
develop  and  evaluate  polynomial  metamodels13'14’15.  TERSM,  therefore,  provides  an  excellent  testbed  for  comparing  competing 
metamodeling  approaches. 


Figure  3.1.  TERSM,  start  of  mission  (left),  end  of  mission  (right). 

The  operation  of  TERSM  is  similar  in  some  respects  to  the  familiar  police  scanner  that  one  can  buy  in  an  electronics  store.  Like  a 
police  scanner,  the  sensor  scans  a  set  of  frequency  channels  in  some  preprogrammed  sequence.  It  dwells  on  a  channel  for  a  short  period 
of  time  waiting  for  an  emitter  to  transmit  on  that  channel.  If  an  emitter  is  transmitting,  a  series  of  sophisticated  tests  are  performed  to 
check  if  the  sensor  can  detect  it.  The  emitter  must  be  within  the  line  of  sight  of  the  sensor  and  not  blocked  by  terrain  or  the  curvature 
of  the  earth.  The  emitter  must  be  within  the  field  of  view  of  the  sensor’s  antenna  pattern.  The  signal  received  by  the  sensor  must  be 
strong  enough  to  detect,  but  not  so  strong  that  it  exceeds  the  sensor’s  dynamic  range.  If  all  these  tests  pass,  the  sensor  records  the 
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detection  data  After  a  short  dwell  time,  the  sensor  switches  to  another  channel.  During  the  brief  time  period  required  to  switch  from 
one  channel  to  the  next,  the  sensor  processes  all  the  detections  made  on  the  previous  channel  to  extract  bearing  angle  measurements^ 
The  number  of  detections  that  can  be  processed  during  the  channel  switching  time  is  determined  by  the  channel  capacity  _  TCRSM  Jas 
many  inputs  and  is  capable  of  generating  many  outputs.  The  inputs  concern  the  aircraft  flight  data  ,he  s®nsor  da“ ‘.  J"  .  ^RSM  ^ 
emitter.  The  primary  outputs  are  the  number  of  emitters  detected  and  the  CEP  for  each  one.  Complete  details  describing  TERSM 

be  found  in1  ~. 

3.2.  Aircraft  Refueling  and  Maintenance  System 

Although  TERSM  provides  an  excellent  testbed  to  study  the  feasibility  of  the  CCNN  metamodeling  approach^  lacks  some  of  the 

features  that  constitute  real  challenges  to  metamodeling.  For  one,  TERSM  lacks  significant  randomness.  That  is,  although  the 

number  of  emitters  detected  does  change  as  a  function  of  the  initial  random  number  seed,  the  change  is  not  significant  Fo  another 
TERSM  does  not  exhibit  asymptotic  relationships  which  are  veiy  common  in  practice.  Exposing  such  asymptotic  behaviors  of  en 
requires  very  long  simulation  runs.  Avoiding  long  simulation  runs  is  one  of  the  real  benefits  of  metamodeling  on 

other  hand,  has  a  running  time  of  just  under  a  minute  on  a  modern  workstation.  For  such  a  simulation,  the  benefits  of metam  g 
are  not  so  compelling.  Additionally,  asymptotic  behavior  can  be  very  difficult  for  polynomials  to  capture,  and  provides  one  of 
primary  motivations  for  the  use  of  more  powerful  metamodeling  methods  like  the  CCNN. 

We  have  therefore  concentrated  on  identifying  a  good  benchmark  problem  with  the  above  features,  based  on  the  literature  of  military 
models' 6  - 1 7  In  1 7  for  example,  a  model  is  considered  for  the  process  of  moving  and  reassigning  strategic  airlift  pilots  with  the 
objective  of  managing  and  ultimately  minimizing  moving  costs  while  maintaining  mission  capability  Motivated  by  this  problem 
we  have  concentrated  on  an  Aircraft  Refueling  and  Maintenance  System  (ARMS).  The  basic  ARMS  model  is  shown  in  Fig.  3.2.  A 
illustrated  ARMS  is  a  multiclass  queueing  system.  Jobs  from  each  class  n  =  1....*  arrive  with  average  rates  to  separate  arrival 
queues  with  capacities  C„  (possibly  infinite).  The  system  has  6  tokens.  At  block  1.  the  jobs  compete  for  a  token  on  a  priority  basis 
with  the  class  1  jobs  having  the  highest  priority.  The  job  waiting  in  the  highest  priority  arrival  queue  will  be  the  first  to  get  a  token 
when  one  becomes  available.  After  getting  a  token,  the  jobs  enter  a  service  queue  with  capacity  Cs.  At  block  2  the  jobs  are  selecred 
from  the  service  queue  according  to  some  service  discipline  (e.g.,  first  in  first  out  (FIFO))  and  routed  to  one  of  *  =  1  servers  e 
time  to  service  a  job  is  a  random  variable  n(n,k)  which  can  vary  as  a  function  of  the  job  class  n  and  the  particular  server  k.  Upon 
completing  service,  the  jobs  proceed  to  block  3,  where  the  token  is  returned  to  a  token  pool,  and  the  job  leaves  the  system. 


Figure  3.2.  The  basic  ARMS  model. 

The  basic  ARMS  model  is  very  general  and  can  be  used  to  represent  a  large  variety  of  Air  Force  C3I  operations.  For  example,  such  a 
model  can  be  used  to  represent  computer  networks,  communications  systems,  or  logistics  problems.  The  specific  problem  we  will 
consider  is  the  aircraft  refueling  and  maintenance  system  (ARMS).  The  ARMS  problem  has  aircraft  requesting  to  land  at  a  particular 
site  (airport  or  aircraft  carrier)  for  refueling  and/or  maintenance  purposes.  Depending  on  aircraft  type,  a  priority  is  assigned  to  each 
aircraft  so  that  high-priority  ones  are  served  first.  Since  landing  capacity  and  associated  maintenance  resources  are  limited,  a  specific 
number  of  "permits”  (i.e.,  tokens)  are  available.  An  aircraft  is,  therefore,  forced  to  wait  until  it  receives  a  permit.  Upon  receiving  a 
permit,  the  aircraft  is  guided  to  a  refueling/maintenance  area.  If  the  resources  required  to  complete  the  refuel.ng/maintenance  process 
are  not  immediately  available  (e.g.,  personnel,  tools,  spare  parts,  fuel),  the  aircraft  is  further  delayed.  When  the  aircraft  completes 
service  the  permit  is  returned  to  the  permit  pool,  and  the  aircraft  proceeds  to  take  off  and  return  to  action.  In  studying  this  system, 
one  is  interested  in  minimizing  the  expected  “down  time"  of  an  aircraft,  with  more  emphasis  given  to  certain  types  of  aircraft  (the  ones 
given  higher  priority).  At  the  same  time,  one  is  interested  in  keeping  service  costs  within  acceptable  levels.  From  a  modeling 
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standpoint,  one  must  therefore  determine  functional  relationships  such  as  the  expected  down  time  of  a  priority  1  aircraft  with  respect  to 
factors  such  as  the  number  of  permits;  or  the  number  of  maintenance  resources  allocated  to  the  refueling/maintenance  process. 


Figure  3.3.  System  used  for  metamodeling. 


Our  metamodeling  studies  in  this  paper  focus  on  the  3-class,  single  server  ARMS  model  shown  in  Fig.  3.3.  In  this  model,  the  class  1 
jobs  have  the  highest  priority,  and  the  class  3  jobs  the  lowest.  Job  arrivals  are  Poisson  with  rates  Ay,  where  i  is  the  customer  class. 
In  order  for  a  job  to  be  served,  it  must  have  one  of  9  tokens.  Jobs  with  tokens  queue  up  to  be  served  by  a  single  server.  At  the 
completion  of  service,  the  jobs  leave  the  system,  and  the  tokens  are  returned  to  the  token  pool  for  reuse.  When  different  classes  of 
jobs  are  competing  for  tokens,  the  class  with  the  highest  priority  gets  one  first.  Jobs  in  the  same  class  compete  for  tokens  on  a  first 
come,  first  served  (FCFS)  basis.  The  arrival  queues  have  infinite  capacity,  and  the  server  queue  has  capacity  Cs  =  8.  Jobs  in  the  server 
queue  are  served  FIFO  (with  no  distinction  made  between  jobs  from  different  classes).  Service  is  nonpreemptive  (once  a  job  begins 
service,  service  cannot  be  interrupted,  and  will  continue  until  completion),  and  the  service  time  is  an  exponential  random  variable  with 
parameter^,  =  1,  i  =  1,2,3. 


4.  NUMERICAL  EXPERIMENTS 

This  section  gives  numerical  results  to  compare  polynomial  metamodels  to  CCNN  metamodels  for  TERSM  and  ARMS.  For  the 
results  that  follow,  the  performance  measure  used  is  the  Mean  Squared  Error  (MSE)  between  the  output  of  the  simulator  (TERSM  or 
ARMS)  and  the  corresponding  output  of  the  metamodel  (polynomial  or  CCNN  ).  For  each  input/output  pair  i  =  in  the  data  set, 
the  mean  squared  error  is  given  by, 

MSE  =  -fe] 

n  i=i 

where  e.  is  the  error  or  difference  between  the  output  simulator  and  the  output  of  the  metamodel. 

4.1.  Baseline  TERSM  problem. 

The  baseline  TERSM  problem  focused  on  the  relationship  between  the  number  of  emitters  detected  which  have  a  CEP  less  than  5 
nautical  miles  in  radius  and  the  following  four  inputs;  the  altitude  of  the  aircraft  (it  flies  at  a  constant  altitude  for  the  duration  of  its 
mission),  the  velocity  of  the  aircraft  (it  flies  at  a  constant  velocity  for  the  duration  of  the  mission),  the  azimuth  of  the  sensor  (the 
angle  of  the  sensor  boresight  relative  to  the  aircraft  heading),  and  the  channel  capacity  of  the  sensor  (the  number  of  bearing  angle 
measurements  that  can  be  processed  during  the  channel  switching  time).  This  four  input  single  output  example  (see  Fig.  4.1)  has  been 
used  in  the  literature  to  develop  polynomial  metamodels13'14'15.  Since  it  is  well  understood,  this  TERSM  problem  provided  an 
excellent  baseline  for  our  feasibility  studies. 


Emitters 
with  CEP  <  5 


Figure  4.1.  Baseline  TERSM  problem. 

Using  a  layered  central  composite  design,  Caughlin1  3,  compiled  a  table  of  49  input/output  pairs  from  which  he  obtained  the  following 
reduced  4th  order  polynomial  metamodel, 


Altitude 


Velocity 


Azimuth. 


Channel  Cap 


Table  4.1.  Range  of  interest  for  each  input  variable. 

Input  Variable 

Lower  Limit 

Upper  Limit 

Altitude  (feet) 

5000 

40,000 

Velocity  (knots) 

186 

1150 

Azimuth  (degrees) 

60 

150 

Channel  Capacity 

4 

30 
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Vv  =  22.4331  - 0.0148x,  -2.7822xz  +0.1432x3  +  3.1432x4  +0.3653x,x3  +1.2439x,x4  +0.1483x2x3  +0.4430x2x4 

+0.2698xjX4  +  0.4369x,x2x3  +  0.3286x,x2x4  +  0.0960x,x3x4  -  0.2791x2x3x4  -  0.8326xf  +  0.7642x2  - 1 .841 3x42 

+0.7577x,3  +  4.9038x2  +  1.0924x3  - 1 .1907x1*  -4.8443x2  .  ... 

,  ,  .  ,  •  r  altitude  x-,  is  the  velocity,  x3  is  the  azimuth  angle,  and  x4  is  the  channel  capacity.  All 

;;  ~  -  *.  *  ~ » -*■  -  ™ 

software,  we  were  able  to  duplicate  Caughlin’s  results.  This  served  to  validate  our  version  of  the  TERSM  simulation  software. 

To  obtain  a  CCNN  metamodel,  we  randomly  generated  238  input/output  pairs  using  TERSM,  and  used  them  as s  a ^  the 

summarizes  the  results. 

Table  4  2  MSE  comparison  between  the  polynomial  and  CCNN  for  baseline  TERSM. - 

I  49  point  set  ?38  point  NN  set  165  point  test  set 

- EgMi! - 1| - |gF  1  I  - 

Table  4.2.  ,he  49  point  se, is  the  se,  of  input/.utpu.  pairs  to  were  used  t  Z£g 

,„M  „c,y 40-0/.  The  238  poin.se,  is  U»  set  of  , ^  X 

■«  I"  CCm~  f '  165  point  sells  “to  see  Itow  well  dte  models  generalise  to  data  wbieh  they 

represents  an  independent  testing  set.  The  purpose  ot  the  t  g  f  the  :  ut  space  which  are  outside  the  range 

s 

on  the  testing  set  is  the  most  important  performance  criterion. 

emitter  field  for  a  shorter  period  of  time.  The  azimuth  was  found  to  have  httle  effect,  and  me W tm  .$  easy  l0  fit 

the  faci  thai  the  ccNN  was  able  10 

give  such  good  performance  with  only  19  neurons  in  its  hidden  layer. 

This  prompted  us  to  develop  a  -toughed-  TERSM  proven, with  a  "rougher" a  nd  mom  “t«  ZSZfi 

TERSM  ^p^fQ^the'baseli'ne'T^RslyTprobleni^di^rartpu*  i^thVnumber'o/emihers'detected  which  have  a  CEP  les^'6^ 

72!  ™  s  t,  f1  4  2 he  inputs  rue,  the  initial  airoraf,  coordinates  Xo  and  Td,  the  initial  amoral,  compass  headmg  9  (the 
aircmifflieT  a  constant  heading),  and  the  aircraft  velocity  V  (dte  aircraft  flies  a,  a  constant  velocty).  _ 


Table  4.3.  Range  of  interest  for  each  input  variable. 


Emitters 
with  CEP  <  5 


TERSM 


Input  Variable 
Xo 
Yo 

Heading  (degrees; 
Velocity  (knots) 


Lower  Limit 
0 
0 
0 

186 


per  Limit 
400 
300 
359 
1150 


Figure  4.2.  “Tougher”  TERSM  problem. 
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To  get  some  insight  as  to  why  this  TERSM  problem  is  “tougher"  consider  the  following.  The  TERSM  experiments  were  conducted 
using  an  emitter  set  which  has  2359  emitters  all  located  in  a  rectangular  region  on  the  surface  of  a  sphere.  Imagine  now  a  set  of 
experiments  where  the  initial  location  of  the  aircraft  is  always  to  the  north  of  this  rectangular  region,  and  suppose  that  for  each 
experiment  we  only  change  the  aircraft  heading  (the  aircraft  flies  a  constant  heading  for  the  duration  of  its  mission).  For  a  zero 
heading  angle  (due  north)  we  are  flying  away  from  the  emitter  field  and  no  emitters  are  detected.  As  the  angle  increases  towards  1 80 
degrees  (due  south)  more  and  more  emitters  are  detected  as  we  fly  over  an  increasingly  larger  portion  of  the  emitter  field.  Then  as  the 
angle  increases  beyond  180  degrees,  the  number  of  emitters  detected  decreases.  Thus,  the  tougher  TERSM  problem  has  a  response 
surface  consisting  of  a  collection  of  isolated  “humps”  where  many  emitters  are  detected,  surrounded  by  large  flat  regions  where  no 
emitters  are  detected.  Such  a  response  surface  should  be  quite  difficult  for  a  low  order  polynomial  to  capture.  The  CCNN,  however, 
due  to  its  universal  function  approximation  capabilities,  should  be  able  to  approximate  the  response  surface.  Nevertheless,  such  a 
response  surface  still  poses  difficulties  for  the  CCNN  in  terms  of  data  collection.  If  the  enough  points  are  not  chosen,  it  is  possible  to 
miss  some  the  humps  entirely. 

As  before,  we  performed  simulation  experiments  to  construct  three  data  sets:  a  training  set  for  the  polynomial,  a  randomly  chosen 
training  set  for  the  CCNN,  and  a  randomly  chosen  testing  set  with  which  to  compare  the  two.  The  polynomial  training  set  was 
obtained  using  a  layered  CCD  design  consisting  of  the  49  input/output  pairs.  The  CCNN  training  set  consisted  of  14,641 
input/output  pairs,  and  the  independent  testing  set  consisted  of  500  input/output  pairs. 

Using  the  49  point  set  we  obtained  the  following  4th  order  polynomial  metamodel.  As  for  the  baseline  TERSM  problem,  a  square 
root  nonlinearity  on  the  output  gives  the  best  results: 

yy  =  20.3885  -  4.8499*,  +  7.7436*2  -  2.6979*3  -  2.9343*,  -  1 .295  1*,*2  +  3.6486*,.*,  +  1 .2 1 44*,*,  +  1 ,9655*2*3  -  2.3306*,*, 

+0.0265*.,*,  -  1.0556*, *2*3  - 1.0158*,*2*4  +  1 ,4222*,*3*4  +0.2683*2*3*4  - 1 ,0927*,*2*3*4  - 13.5657*, 2  - 12.1073*2 

+0.6690*2  +3.1 144*2  +6.4287*, 3  -9.1515*2  +  3.3922*2  +5.2131*’  +  4.7740**  +4.6920**  -0.6565*34  -2.9759*44 
In  the  equation  above,  *,  is  the  initial  aircraft  x-location,  *2  is  the  initial  aircraft  y -location,  *3  is  the  aircraft  heading  (0  degrees  is  due 
north),  and  *4  is  the  aircraft  velocity.  All  inputs  are  centered  and  scaled,  and  the  output  y  is  the  number  of  emitters  located  with  a  CEP 
less  than  5  nautical  miles. 

Comparative  results  are  shown  in  Table  4.4.  The  ruggedness  of  the  response  surface  resulted  in  the  polynomial  giving  very  poor 
performance.  Even  on  the  49  point  set,  the  one  used  to  obtain  the  polynomial  coefficients,  the  MSE  was  37,278.  That  is,  on  the 
average,  the  output  of  the  polynomial  and  the  output  of  TERSM  differed  by  193  emitters,  a  huge  difference  when  one  considers  that 
the  number  of  emitters  detected  with  a  CEP  <  5  nautical  miles  averaged  only  243  and  never  exceeded  849.  The  CCNN  gives  much 
better  performance  following  a  rigorous  training  process  consisting  of  many  more  training  points. 


Table  4.4.  MSE  comparison  between  the  polynomial  and  CCNN  for  “tougher"  TERSM. 

49  point  set 

14,641  point  NN  set 

500  point  test  set 

polynomial 

37,278 

66,035 

61,638 

CCNN 

5,243 

4,379 

4,671 

4.3.  Baseline  ARMS  problem. 

Next  we  present  metamodeling  results  using  the  ARMS  model  from  Fig.  4.3  for  the  range  of  inputs  in  Table  4.5.  The  inputs  A,  are 
the  arrival  rates  for  the  three  customer  classes,  8  is  the  number  of  tokens,  and  the  output,  J  =  r, ,  is  the  service  time  of  the  class  1 
(highest  priority)  customers.  The  service  time  is  the  time  from  the  customer  arrival  at  the  system  to  the  time  the  customer  finishes 
service  and  departs  the  system.  In  the  context  of  aircraft  refueling  and  maintenance,  this  is  the  “downtime”  or  amount  of  time  that  an 
aircraft  is  not  available  for  service. 

From  the  preliminary  analysis  in  Cassandras  and  Gong6  we  already  have  some  intuition  about  how  we  can  expect  ARMS  to  behave. 
When  the  number  of  tokens  8  is  small,  the  system  will  appear  to  the  class  1  customers  as  if  they  are  the  only  ones  using  the  system. 
This  is  because,  regardless  of  the  arrival  rates  of  the  lower  priority  customers,  if  there  are  customers  waiting  in  the  class  1  arrival 
queue,  then  those  customers  will  get  the  tokens  as  they  become  available.  When  there  are  not  many  tokens,  the  queue  at  the  server 
will  be  short,  and  the  customer  will  pass  quickly  through  the  system.  As  the  arrival  rate  of  the  class  1  customers,  A],  increases 
towards  the  service  capability  of  the  server,  l//i,  arriving  class  1  customers  will  tend  to  find  long  queues  in  the  arrival  queue,  their 
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wauine  lime  in  the  arrival  queue  will  be  long,  and  thus  their  service  times  will  increase.  It  is  only  when  the  arrival  rate  of  the  c  ass 
customers  is  relatively  low  the  number  of  tokens  is  large,  and  the  arrival  rate  of  the  other  classes  of  customers  is  high  that  the  class  1 
customers  will  feel  the  effects  of  the  lower  priority  customers.  This  is  because  when  the  arrival  rate  of  the  class  1  customers  is  v., 
the  class  1  arrival  queue  will  often  be  empty,  and  lower  class  customers  will  be  able  to  get  tokens  more  often.  As  a  result  when  a 
c  ss  1  customer  drives  and  gets  a  token,  it  will  find  many  other  customers  ahead  of  it  m  the  server  queue.  Thus,  although  the 
waiting  time  in  the  arrival  queue  will  be  low,  the  waiting  time  in  the  server  queue  will  be  longer  because  there  ,s  no  priority  there. 
Thus  we  get  a  response  surface  that  increases  asymptotically  with  X]  and  monotomcally  with  ft 


Fig.  4.3.  Baseline  ARMS  problem. 


Table  4.5.  Range  of  interest  for  each  input  variable. 

Input  Variable  Lower  Limit  Upper  Limit 
A i  (customers/sec) _ 0_1 _ '  -0 

A2 _ 0J _ io 

At  0.1 _ 10 

- e  i  20 


Follow, ng  our  usual  approach,  we  used  a  layered  CCD  design  to  collect  49  training  points  for  a  polynomial  i C  49  oo^s  eave^he 
selected  points  to  train  a  CCNN  metamodel,  and  500  randomly  selected  points  for  comparing  the  two  models ’  p ^"“s  the  best 
following  4th  order  polynomial  metamodel.  As  with  all  previous  problems,  a  square  root  transformation  on  the  output  gives 

v y  =  3.6325  -0.1 023x,  -  0. 1 238x2  -  0.0383*,  + 1 .3530x4  -  0.01 29x,x2  -  0. 1 893x,x3  +  0. 1437x,*4  -  0.5784  x2*3  +  0.2346x2*4 
+0.2536x3x4  -0.1399*, *2*3  -0.1262x,*2*4  -  0.0960*, *3*4  -0.4492x2*3*4  -  0.0848x,x2x3*4  -0.41 17*?  +  0.1263*’ 

-0.44 1 6xl  -  0.0308*’  + 1 .7967*’  +  0.5235x23  +  0.2576*’  +  0.0052x4’  + 1 .9208**  -  0.2488**  +  0. 1 862*,*  -  0.4665*44 
Table  4.6  summarizes  our  comparative  results  for  the  baseline  ARMS  problem. 

_ Table  4,6.  Comparison  between  the  polynomial  and  CCNN  for  baseline  ARMS. - 

|  49  point  set  [  500  point  NN  set  I  500  point  test  set  __ 


olvnomial 


CCNN  I 


CCNN  II 


17.74  (34.3% 


13.01  (22.9% 


10.57  (16.3% 


In  Table  4  6  CCNN  I  was  trained  to  learn  a  scaled  version  of  the  output,  0.17,  and  CCNN  II  was  trained  to  learn  the  natural  log  of  the 
output,  log (7).  Because  of  the  asymptotic  behavior  of  ARMS,  it  was  felt  that  the  log  would  “flatten”  the  output  < curve  m aking  the 
learning  task  easier.  Interestingly,  the  performance  of  the  two  networks  is  almost  identica  as  was  the  final  number  of  hldden  >a 
neurons  in  each-  25  for  CCNN  I  and  23  for  CCNN  II.  In  comparing  the  CCNN  to  the  polynomial  metamodel,  performance  on 
500  point  test  set  is  really  the  most  important  since  it  provides  a  measure  of  the  generaliMtion  capability  over  a  representative  samp  e 
of  the  input  space  (unlike  the  49  point  set  which  covers  the  extreme  cases).  In  terms  of  the  MSE  cr.tenon,  both  the  CCNN  an 
polynomial  appear  to  generalize  well.  When  the  range  of  the  output  ,s  small,  as  it  is  m  this  case  however,  MSE 
misleading.  A  better  performance  measure  to  use  in  such  cases  is  the  Mean  Squared  Relative  Error  (MSRE),  given  by, 


r=-yf-T 


where  e,  is  the  error  between  the  target  and  network  output  and  y,-  is  the  target  output.  This  criterion  (shown  in  parenthesis  in  Table 
4.6)  clearly  demonstrates  the  superiority  of  the  CCNN  with  the  log  transformation. 

4.4.  Concurrent  simulation  to  speed  data  collection  in  the  ARMS  model 

For  DES  like  the  ARMS  model,  concurrent  estimation  can  be  used  to  speed  the  process  of  collecting  the  input/output  pairs  needed  to 
obtain  a  metamodel.  One  of  the  parameters  that  affects  the  system  time  of  the  class  1  customers  in  the  ARMS  model  is  the  number  o 
tokens  ft  As  described  previously,  changing  the  number  of  tokens  changes  their  resulting  system  times,  with  a  smaller  number  of 
tokens  favoring  the  higher  priority  customers  (see  Fig.  4.4). 
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Since  analytical  performance  measures  for  this  system  are  hard  to  derive,  in  order  to  determine  the  function  in  Fig.  4.4  it  is  necessary 
to  resort  to  simulation  with  at  least  one  simulation  for  each  different  value  of  6,.  As  described  in  Section  2.3,  concurrent  estimation 
techniques  allow  one  the  obtain  performance  estimates  under  any  number  of  tokens  6\...9N  by  observing  only  a  single  sample  path 
under  some  6b  tokens.  Generally,  the  amount  of  time  to  collect  these  N+ 1  performance  estimates  will  be  much  less  than  the  time 
required  to  run  N+\  individual  “brute  force”  simulation  experiments. 


Tokens 


Figure  4.4.  Effect  of  changing  the  number  of  tokens  (  0)on  the  system  time. 

To  illustrate  just  how  significant  this  speedup  can  be,  consider  the  ARMS  model  from  Fig.  3.3.  Assume  that  all  arrival  queues  have 
infinite  capacity,  all  arrival  processes  are  Poisson,  and  all  customer  classes  have  identical  (exponentially  distributed)  service  time 
requirements.  Under  these  conditions,  Fig.  4.5  shows  the  speedup  that  can  be  achieved  using  some  recent  extensions  to  the  TWA 
scheme  mentioned  in  Section  2.3.  In  the  figure,  speedup_l  corresponds  to  the  average  speedup  achieved  over  all  concurrently 
constructed  sample  paths.  Speedup_2  corresponds  to  the  marginal  speedup  when  going  from  n  to  n+1  parallel  sample  paths. 


Nuirfrer  of  Parallels  vnple Paths 

Figure  4.5.  Speedup  factor  for  concurrent  estimation. 


5.  CONCLUSIONS 

Simulation  is  replacing  actual  experiment  for  reasons  of  cost,  safety,  and  practicality.  One  application  area  where  simulation  is  seeing 
an  increased  use  is  in  design  and  decision  making  forC3I  systems.  Such  issues  typically  require  simulation  to  answer  a  multitude  of 
"what  if'  questions.  The  difficulty  is  that  complex  simulations  can  be  very  time  consuming  even  on  modern  supercomputers.  What 
is  needed,  therefore,  is  a  way  to  speed  the  processes  of  evaluating  the  outcome  of  competing  “what  if’  scenarios.  This  is  the 
motivation  behind  the  idea  of  metamodeling — building  a  simpler  higher-level  model  of  a  simulation  model.  This  is  a  function 
approximation  task,  i.e.,  replacing  the  simulation  with  some  function  which  has  the  same  input/output  behavior  as  the  simulator. 
The  benefit  of  such  an  approach  is  that  evaluating  the  function  is  orders  of  magnitude  faster  than  running  the  simulation,  allowing  for 
real-time  decision  making.  Traditionally,  polynomials  have  been  used  for  metamodeling.  In  this  paper,  we  proposed  neural  networks 
as  an  alternative  metamodeling  device  in  order  to  exploit  their  “universal  function"  approximation  capability  and  their  superior 
scalability  to  complex  function  approximation  problems  which  have  a  large  number  of  inputs  and  outputs.  Polynomials,  on  the  other 
hand,  have  known  limitations  in  terms  of  their  function  approximation  power,  and  they  do  not  scale  well  as  the  problem  size  and 
complexity  increases.  In  comparing  polynomials  to  neural  networks,  we  saw  that  constructing  polynomial  metamodels  requires  a 
great  deal  of  technical  sophistication  on  the  part  of  the  metamodeling  practitioner.  First,  one  has  construct  an  experimental  design  to 
decide  what  points  to  collect  for  obtaining  the  coefficients  of  the  polynomial.  Second,  one  must  examine  the  residual  errors  to  decide 
when  certain  terms  are  not  significant,  in  which  case  their  coefficients  are  set  to  zero.  Third,  for  many  polynomial  curve  fitting 
problems,  it  is  advantageous  to  use  a  nonlinear  transformation,  such  as  the  square  root,  on  the  output.  In  contrast,  constructing 
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metamodels  using  neural  networks  is  easier  as  it  simply  involves  randomly  selecting  a  set  of  training  points  and  presenting  them  to  a 
neural  network  training  algorithm.  Our  numerical  performance  results  demonstrate  the  superiority  of  the  CCNN  over  polynomials  in 
terms  of  generalization  capability  on  several  nontrivial  metamodeling  examples. 

Neural  networks,  however,  are  not  a  panacea  as  they  also  involve  some  challenging  issues.  The  first  issue  involves  the  determination 
of  the  network  size  (i.e.,  the  number  of  neurons  to  use  in  the  hidden  layer).  Too  few  and  the  neural  network  will  underfit,  too  many 
and  the  neural  network  will  overfit.  To  deal  with  this  issue  we  chose  the  Cascade  Correlation  Neural  Network  (CCNN)  architecture 
because  it  automatically  decides  how  many  hidden  neurons  are  necessary  during  learning.  Second,  neural  network  training  algorithms 
can  be  very  slow  to  converge.  The  CCNN  also  addresses  this  difficulty,  and  uses  a  training  algorithm  that  is  generally  faster  than 
methods  such  as  the  usual  error  backproprogation  algorithm.  A  third  difficulty  is  the  number  of  input/output  pairs  needed  to  train  the 
neural  network.  For  DES  simulations  we  demonstrated  concurrent  estimation  as  a  way  to  collect  several  input/output  pairs  from  a 
single  simulation  run  in  much  less  time  than  it  would  take  to  collect  these  same  points  via  individual  simulation  experiments  In 
addition  to  concurrent  estimation,  what  is  needed  is  some  “adaptive”  data  collection  strategy  which  can  use  the  history  of  the  points 
already  sampled  to  decide  where  to  sample  next.  The  objective  of  such  a  strategy  is  to  locate  the  “bumps”  in  the  output  function  so  as 
to  take  few  samples  where  the  function  is  “fiat”  and  more  samples  from  the  “bumps.”  One  idea  here  is  to  use  perturbation  analysis 
techniques  for  DES 1 8  • 1 9  to  extract  derivative  information  from  each  simulation  run.  One  can  then  use  these  derivative  estimates  as 
local  “roughness”  measures  to  decide  where  to  sample  next.  This  sort  of  “adaptive  intelligent  experimental  design”  remains  an  open 
issue  and  a  subject  of  our  ongoing  research. 
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